分类默认分类下的文章

Python2字符编码问题小结

作者: forthxu
时间: December 21, 2016
分类: 默认分类
评论

Python docs - Unicode HOWTO

Python docs - Built-in Types

Stack Overflow - Why does Python print unicode characters when the default encoding is ASCII?

理论

编码中的Unicode和UTF-8

Unicode是字符集，UTF-8是Unicode的一种编码方式，并列的还包括UTF-16、UTF-32等。

某个字符的Unicode通过查询标准得到，其UTF-8编码由Unicode码计算得到。

Python2中的str和unicode

str和unicode是两个不同的类。

str存储的是已经编码后的字节序列，输出时看到每个字节用16进制表示，以\x开头。每个汉字会占用3个字节的长度。

>>> a = '啊哈哈'
>>> type(a)
<type 'str'>
>>> a
'\xe5\x95\x8a\xe5\x93\x88\xe5\x93\x88'
>>> len(a)
9
>>> a[2]
'\x8a'

unicode是“字符”串，存储的是编码前的字符，输出是看到字符以\u开头。每个汉字占用一个长度。定义一个Unicode对象时，以u
开头。

>>> b = u'哟呵呵'
>>> type(b)
<type 'unicode'>
>>> b
u'\u54df\u5475\u5475'
>>> len(b)
3
>>> b[2]
u'\u5475'

str可以通过decode()方法转化为unicode对象，参数指明编码方式。

>>> a.decode('utf-8')
u'\u554a\u54c8\u54c8'

unicode可以通过encode()方法转化为str对象，参数指明编码方式。

>>> b.encode('utf-8')
'\xe5\x93\x9f\xe5\x91\xb5\xe5\x91\xb5'

默认编码

Python2中的默认编码，有多个不同的变量。

代码文件开头的coding

 # -*- coding: utf-8 -*-

或

 # coding=utf-8

指明代码文件中的字符编码，用于代码文件中出现中文的情况。

 % cat hello.py
 #! /usr/bin/env python
 # coding=utf-8
 print '泥壕'
 
 % python hello.py
 泥壕

如果不设置，默认是ascii，当出现中文字符时就不能正常识别。

 % cat hello.py
 #! /usr/bin/env python
 print '泥壕'
 
 % python hello.py
     File "hello.py", line 2
 SyntaxError: Non-ASCII character '\xe6' in file hello.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

sys.stdin.encoding和sys.stdout.encoding

sdtin和stdout输入输出使用的编码，包命令行参数和print输出，由locale环境变量决定。

在en_US.UTF-8的系统中，默认值是UTF-8。
sys.getdefaultencoding()

文件读写和字符串处理等操作使用的默认编码。

默认值是ascii。

字符串拼接

unicode和str类型通过+拼接时，输出结果是unicode类型，相当于先将str类型的字符串通过decode()方法解码成unicode，再拼接。此时如果解码时没有明确指明编码类型，可能会出现错误。

>>> a = '啊哈哈'
>>> b = u'哟呵呵'
>>>
>>> a + b
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128)
>>>
>>> a.decode('utf-8') + b
u'\u554a\u54c8\u54c8\u54df\u5475\u5475'

错误提到'ascii' codec can't decode byte 0xe5，这是因为自动将str类型的变量按照默认的编码格式sys.getdefaultencoding()来解码，默认编码即ascii，而这个字符不在ascii的范围内，就出现了错误。

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>>
>>> a = '啊哈哈'
>>> b = u'哟呵呵'
>>> a + b
u'\u554a\u54c8\u54c8\u54df\u5475\u5475'

文件读取和json解析

读文件得到的结果是str类型，以\x开头的十六进制表示。

>>> f = open('t.txt')
>>> a = f.read()
>>> a
'{"hello":"\xe5\x92\xa9"}\n'

而经过json解析后会自动转为unicode。

>>> json.loads(a)
{u'hello': u'\u54a9'}

输出

输出到文件

str类型可以输出到文件，而unicode类型必须先编码成str。

>>> a = '啊哈哈'
>>> b = u'哟呵呵'
>>> a
'\xe5\x95\x8a\xe5\x93\x88\xe5\x93\x88'
>>> b
u'\u54df\u5475\u5475'
>>> 
>>> f = open('t.txt', 'w')
>>> f.write(a)
>>> f.write(b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> f.write(b.encode('utf-8'))

unicode输出到文件时的错误是由于默认编码为ascii，无法自动完成编码过程。如果将sys.getdefaultencoding()编码设置成了utf-8就可以自动完成转换过程了。

>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf-8')
>>>
>>> f.write(b)

计算md5

同样，md5计算也要求输入的unicode先编码。

>>> a = '啊哈哈'
>>> b = u'哟呵呵'
>>> import hashlib
>>> hashlib.md5(a).hexdigest()
'f38b302e2993ec3fdad79c4d76074b21'
>>> hashlib.md5(b).hexdigest()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
>>> hashlib.md5(b.encode('utf-8')).hexdigest()
'c02dc06719bafeaf60505b11d3c0c90a'

输出到stdout

输出到stdout时，默认编码是sys.stdout.encoding，默认值取决于系统环境变量，所以print输出汉字时才可以不用指定utf-8。

>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print u'\u54a9'
咩

而在zh_CN.GB2312的环境中，默认值不是utf-8，就不能正常输出了。

>>> import sys
>>> sys.stdout.encoding
'ANSI_X3.4-1968'
>>> print u'\u54a9'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u54a9' in position 0: ordinal not in range(128)

命令行参数读取

通过sys.argv或argparse得到的命令行参数都是编码后的str类型，以\x开头的十六进制表示。可以通过sys.stdin.encoding得到命令行传入的编码类型，解码成unicode。

#! /usr/bin/env python
# coding = utf-8
import sys

print repr(sys.argv[1])
print sys.stdin.encoding
print repr(sys.argv[1].decode(sys.stdin.encoding))

输出结果。

~/workspace % python hello.py "哇嘿嘿"  
'\xe5\x93\x87\xe5\x98\xbf\xe5\x98\xbf'
UTF-8
u'\u54c7\u563f\u563f'

如果命令行环境已经改成GB2312等其他编码，python找不到与之匹配的编码类型，就会将默认编码sys.stdin.encoding设置成ascii，无法通过这种方法正常解码成unicode。

带\u的字符串转unicode

可能会遇到汉字被转换成unicode编码的形式表示的情况，即一个汉字被表示成了\u????的形式。

>>> a = u'咩'
>>> a
u'\u54a9'
>>> b = '\u54a9'
>>> b
'\\u54a9'

上述b就是这样的情况。此时b是一个长度为6的字符串，而不是一个汉字。

要把b表示为汉字编码有两种方法。

unicode-escape编码。

 >>> unicode(b, 'unicode-escape')
 u'\u54a9'

或

 >>> b.decode('unicode-escape')
 u'\u54a9'

eval拼接。

 >>> eval('u"' + b.replace('"', r'\"')+'"')
 u'\u54a9'

网页正文提取算法和相似文章比较算法

作者: forthxu
时间: August 18, 2016
分类: 默认分类
评论

两个比较好的网页正文提取算法：

国内：
哈工大的《基于行块分布函数的通用网页正文抽取》该算法开源网址为http://code.google.com/p/cx-extractor/，文章中呈准确率95%以上，对1000个网页抽取耗时21.29秒。看了文章感觉不错，无需html解析，效率应该会高些。

Html2Article C# http://www.cnblogs.com/jasondan/p/3497757.html

国外：
大名鼎鼎的arc90实验室的Readability，该算法已经商业化实现了firefox,chrome插件，及flipboard，并且已经集成进了safari浏览器。未详细测试，大致测试感觉准确率应该至少在90%以上。该算法需要解析DOM树，因此稍执行效率稍微慢一些。大致过程为，先解析DOM树，所有标签小写。然后去除所有“script”标签内容，再通过一对正则表达式的配合提取。具体算法还未看。其插件中包含算法JAVASCRIPT源码。
有热心人士已将其用c#和php实现，源码地址如下:

官方网站 http://www.readability.com/
c#实现一：https://github.com/marek-stoj/NReadability
c#实现二：https://github.com/marek-stoj/NReadability
php实现一:　https://bitbucket.org/fivefilters/php-readability
php实现二: https://github.com/feelinglucky/php-readability 作者主页: http://www.gracecode.com/archives/3061/
node.js版：https://github.com/arrix/node-readability/

原文：http://www.cnblogs.com/phoenixnudt/articles/2382140.html

相似文章算法

『simhash算法』

simhash是google用来处理海量文本去重的算法。 google出品，你懂的。 simhash最牛逼的一点就是将一个文档，最后转换成一个64位的字节，暂且称之为特征字，然后判断重复只需要判断他们的特征字的距离是不是<n（根据经验这个n一般取值为3），就可以判断两个文档是否相似。

simhash 实现的工程项目

C++ 版本 simhash
Golang 版本 gosimhash

『百度的去重算法』

百度的去重算法最简单，就是直接找出此文章的最长的n句话，做一遍hash签名。n一般取3。工程实现巨简单，据说准确率和召回率都能到达80%以上。

『shingle算法』

shingle原理略复杂，不细说。 shingle算法我认为过于学院派，对于工程实现不够友好，速度太慢，基本上无法处理海量数据。

『其他算法』
具体看微博上的讨论

原文：https://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html

libc.so.6误删后的修复方法

作者: forthxu
时间: July 12, 2016
分类: 默认分类
评论

因为测试Selenium的chromedriver需要2.15以上的libc.so，因此自己编译安装，删除再做软链时发现ls、dir一系列命名不能使用的情况，提示：

rm: error while loading shared libraries: libc.so.6: cannot open shared object file: No such file or directory

才意识到动到了系统核心的动态库，补救方法：

LD_PRELOAD=/lib64/libc-2.12.so ln -sf /lib64/libc-2.12.so /lib64/libc.so.6

原理就是优先查找指定动态库。

扩展阅读：
libstdc++.so.6升级

https://github.com/FezVrasta/ark-server-tools/wiki/Install-of-required-versions-of-glibc-and-gcc-on-RHEL-CentOS

https://centos.pkgs.org/5/centos-x86_64/libstdc++-4.1.2-55.el5.x86_64.rpm.html

ftp://ftp.gwdg.de/pub/misc/gcc/releases/gcc-4.9.4/

http://www.mudbest.com/centos%E5%8D%87%E7%BA%A7gcc4-4-7%E5%8D%87%E7%BA%A7gcc4-8%E6%89%8B%E8%AE%B0/

验证码识别

作者: forthxu
时间: July 5, 2016
分类: 默认分类
评论

对验证码识别大致分这几个过程，

第一步获取验证码，
第二对验证码处理，如果颜色单一没什么背景杂色就直接二值化处理，注意阙值，有干扰线的把干扰线和背景去掉，最终变为背景为白色，验证码前景色为黑色。
第三步就是切割，把验证码从图片中切割出来，
第四建立识别库，切割后的图片分类存入识别库，让后需要让程序学习一些验证码后，识别库就有了样例。第四步就是那当前是别的验证码和识别库的验证码进行比对，达到识别验证码的结果。

实例程序代码：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Drawing;
using System.Net;
using System.Collections;
using System.IO;

namespace CheckCodeRecognizeLib
{
    public class CheckCodeRecognize
    {
        #region 成员变量
        //色差阀值 越小消除的杂色越多
        private double threshold = 150;
        //二值阀值 越大效果越不明显
        private double ezFZ = 0.6;
        //背景近似度阀值
        private double bjfz = 80;
        //图片路径
        private string imgPath = string.Empty;
        //每个字符最小宽度
        public int MinWidthPerChar = 7;
        //每个字符最大宽度
        public int MaxWidthPerChar = 18;
        //每个字符最小高度
        public int MinHeightPerChar = 10;
        //学习库保存的路径
        private readonly string samplePath = AppDomain.CurrentDomain.BaseDirectory + "Sample\\";
        #endregion

        #region 图片处理
        /// <summary>
        /// 对传入的图片二值化
        /// </summary>
        /// <param name="bitmap">传入的原图片</param>
        /// <returns>处理过后的图片</returns>
        private Bitmap EZH(Bitmap bitmap)
        {
            if (bitmap != null)
            {
                var img = new Bitmap(bitmap);
                for (var x = 0; x < img.Width; x++)
                {
                    for (var y = 0; y < img.Height; y++)
                    {
                        Color color = img.GetPixel(x, y);
                        if (color.GetBrightness() < ezFZ)
                        {
                            img.SetPixel(x, y, Color.Black);
                        }
                        else
                        {
                            img.SetPixel(x, y, Color.White);
                        }
                    }
                }
                return img;                
            }
            return null;
            
        }
        /// <summary>
        /// 去背景
        /// 把图片中最多的一部分颜色视为背景色 选出来后替换为白色
        /// </summary>
        /// <param name="bitmapImg">将要处理的图片</param>
        /// <returns>返回去过背景的图片</returns>
        private Bitmap RemoveBackGround(Bitmap bitmapImg)
        {
            if (bitmapImg == null)
            {
                return null;
            }          
            //key 颜色  value颜色对应的数量
            Dictionary<Color, int> colorDic = new Dictionary<Color, int>();
            //获取图片中每个颜色的数量
            for (var x = 0; x < bitmapImg.Width; x++)
            {
                for (var y = 0; y < bitmapImg.Height; y++)
                {
                    //删除边框
                    if (y == 0 || y == bitmapImg.Height)
                    {
                        bitmapImg.SetPixel(x, y, Color.White);
                    }

                    var color = bitmapImg.GetPixel(x, y);
                    var colorRGB = color.ToArgb();

                    if (colorDic.ContainsKey(color))
                    {
                        colorDic[color] = colorDic[color] + 1;
                    }
                    else
                    {
                        colorDic[color] = 1;
                    }
                }
            }
            //图片中最多的颜色
            Color maxColor = colorDic.OrderByDescending(o => o.Value).FirstOrDefault().Key;
            //图片中最少的颜色
            Color minColor = colorDic.OrderBy(o => o.Value).FirstOrDefault().Key;

            Dictionary<int[], double> maxColorDifDic = new Dictionary<int[], double>();
            //查找 maxColor 最接近颜色
            for (var x = 0; x < bitmapImg.Width; x++)
            {
                for (var y = 0; y < bitmapImg.Height; y++)
                {
                    maxColorDifDic.Add(new int[] { x, y }, GetColorDif(bitmapImg.GetPixel(x, y), maxColor));
                }
            }
            //去掉和maxColor接近的颜色 即 替换成白色
            var maxColorDifList = maxColorDifDic.OrderBy(o => o.Value).Where(o => o.Value < bjfz).ToArray();
            foreach (var kv in maxColorDifList)
            {
                bitmapImg.SetPixel(kv.Key[0], kv.Key[1], Color.White);
            }
            return bitmapImg;
           
        }
        /// <summary>
        /// 获取色差
        /// </summary>
        /// <param name="color1"></param>
        /// <param name="color2"></param>
        /// <returns></returns>
        private double GetColorDif(Color color1, Color color2)
        {
            return Math.Sqrt((Math.Pow((color1.R - color2.R), 2) +
                Math.Pow((color1.G - color2.G), 2) +
                Math.Pow((color1.B - color2.B), 2)));
        }
        /// <summary>
        /// 去掉目标干扰线
        /// </summary>
        /// <param name="img">将要处理的图片</param>
        /// <returns>去掉干干扰线处理过的图片</returns>  
        private Bitmap btnDropDisturb_Click(Bitmap img)
        {
            if (img == null)
            {
                return null;
            }         
            byte[] p = new byte[9]; //最小处理窗口3*3
            //去干扰线
            for (var x = 0; x < img.Width; x++)
            {
                for (var y = 0; y < img.Height; y++)
                {
                    Color currentColor = img.GetPixel(x, y);
                    int color = currentColor.ToArgb();

                    if (x > 0 && y > 0 && x < img.Width - 1 && y < img.Height - 1)
                    {
                        #region 中值滤波效果不好
                        ////取9个点的值
                        //p[0] = img.GetPixel(x - 1, y - 1).R;
                        //p[1] = img.GetPixel(x, y - 1).R;
                        //p[2] = img.GetPixel(x + 1, y - 1).R;
                        //p[3] = img.GetPixel(x - 1, y).R;
                        //p[4] = img.GetPixel(x, y).R;
                        //p[5] = img.GetPixel(x + 1, y).R;
                        //p[6] = img.GetPixel(x - 1, y + 1).R;
                        //p[7] = img.GetPixel(x, y + 1).R;
                        //p[8] = img.GetPixel(x + 1, y + 1).R;
                        ////计算中值
                        //for (int j = 0; j < 5; j++)
                        //{
                        //    for (int i = j + 1; i < 9; i++)
                        //    {
                        //        if (p[j] > p[i])
                        //        {
                        //            s = p[j];
                        //            p[j] = p[i];
                        //            p[i] = s;
                        //        }
                        //    }
                        //}
                        ////      if (img.GetPixel(x, y).R < dgGrayValue)
                        //img.SetPixel(x, y, Color.FromArgb(p[4], p[4], p[4]));    //给有效值付中值
                        #endregion

                        //上 x y+1
                        double upDif = GetColorDif(currentColor, img.GetPixel(x, y + 1));
                        //下 x y-1
                        double downDif = GetColorDif(currentColor, img.GetPixel(x, y - 1));
                        //左 x-1 y
                        double leftDif = GetColorDif(currentColor, img.GetPixel(x - 1, y));
                        //右 x+1 y
                        double rightDif = GetColorDif(currentColor, img.GetPixel(x + 1, y));
                        //左上
                        double upLeftDif = GetColorDif(currentColor, img.GetPixel(x - 1, y + 1));
                        //右上
                        double upRightDif = GetColorDif(currentColor, img.GetPixel(x + 1, y + 1));
                        //左下
                        double downLeftDif = GetColorDif(currentColor, img.GetPixel(x - 1, y - 1));
                        //右下
                        double downRightDif = GetColorDif(currentColor, img.GetPixel(x + 1, y - 1));

                        ////四面色差较大
                        //if (upDif > threshold && downDif > threshold && leftDif > threshold && rightDif > threshold)
                        //{
                        //    img.SetPixel(x, y, Color.White);
                        //}
                        //三面色差较大
                        if ((upDif > threshold && downDif > threshold && leftDif > threshold)
                            || (downDif > threshold && leftDif > threshold && rightDif > threshold)
                            || (upDif > threshold && leftDif > threshold && rightDif > threshold)
                            || (upDif > threshold && downDif > threshold && rightDif > threshold))
                        {
                            img.SetPixel(x, y, Color.White);
                        }

                        List<int[]> xLine = new List<int[]>();
                        //去横向干扰线  原理 如果这个点上下有很多白色像素则认为是干扰
                        for (var x1 = x + 1; x1 < x + 10; x1++)
                        {
                            if (x1 >= img.Width)
                            {
                                break;
                            }

                            if (img.GetPixel(x1, y + 1).ToArgb() == Color.White.ToArgb()
                                && img.GetPixel(x1, y - 1).ToArgb() == Color.White.ToArgb())
                            {
                                xLine.Add(new int[] { x1, y });
                            }
                        }
                        if (xLine.Count() >= 4)
                        {
                            foreach (var xpoint in xLine)
                            {
                                img.SetPixel(xpoint[0], xpoint[1], Color.White);
                            }
                        }

                        //去竖向干扰线

                    }
                }
            }
            return img;
        }
        /// <summary>
        /// 对图片先竖向分割，再横向分割
       /// </summary>
       /// <param name="img">将要分割的图片</param>
        /// <returns>所有分割后的字符图片</returns>
        private Bitmap[] SplitImage(Bitmap img)
        {
            if (img == null)
            {
                return null;
            }
            List<int[]> xCutPointList = GetXCutPointList(img);
            List<int[]> yCutPointList = GetYCutPointList(xCutPointList, img);       
            Bitmap[] bitmapArr = new Bitmap[5];
            //对分割的部分划线
            for (int i = 0; i < xCutPointList.Count(); i++)
            {
                int xStart = xCutPointList[i][0];
                int xEnd = xCutPointList[i][1];
                int yStart = yCutPointList[i][0];
                int yEnd = yCutPointList[i][1];
                if (i >= 4) break;
                bitmapArr[i]= (Bitmap)AcquireRectangleImage(img,
                    new Rectangle(xStart, yStart, xEnd - xStart + 1, yEnd - yStart + 1));
            }
            return bitmapArr;
        }
        /// <summary>
        /// 分别从图片的上下寻找像素点大于阙值的地方，然后获取有黑色像素的有效区域
        /// </summary>
        /// <param name="xCutPointList">x轴范围的x坐标集合</param>
        /// <param name="img">目标图片</param>
        /// <returns>y轴坐标开始和结束点，其实就是黑色像素图片的有效区域</returns>
        private List<int[]> GetYCutPointList(List<int[]> xCutPointList, Bitmap img)
        {
            List<int[]> list = new List<int[]>();
            //获取图像最上面Y值
            int topY = 0;
            //获取图像最下面的Y值
            int bottomY = 0;
            foreach (var xPoint in xCutPointList)
            {
                for (int ty = 1; ty < img.Height; ty++)
                {
                    int xStart = xPoint[0];
                    int xEnd = xPoint[1];
                    int blackCount = GetBlackPXCountInY(ty, 2, xStart, xEnd, img);
                    if (blackCount > 3)
                    {
                        topY = ty;
                        break;
                    }
                }
                for (int by = img.Height; by > 1; by--)
                {
                    int xStart = xPoint[0];
                    int xEnd = xPoint[1];
                    int blackCount = GetBlackPXCountInY(by, -2, xStart, xEnd, img);
                    if (blackCount > 3)
                    {
                        bottomY = by;
                        break;
                    }
                }
                list.Add(new int[] { topY, bottomY });

            }
            return list;
        }
        /// <summary>
        /// 获取分割后某区域的黑色像素
        /// </summary>
        /// <param name="startY"></param>
        /// <param name="offset"></param>
        /// <param name="startX"></param>
        /// <param name="endX"></param>
        /// <param name="img"></param>
        /// <returns></returns>
        private int GetBlackPXCountInY(int startY, int offset, int startX, int endX, Bitmap img)
        {
            int blackPXCount = 0;
            int startY1 = offset > 0 ? startY : startY + offset;
            int offset1 = offset > 0 ? startY + offset : startY;
            for (var x = startX; x <= endX; x++)
            {
                for (var y = startY1; y < offset1; y++)
                {
                    if (y >= img.Height)
                    {
                        continue;
                    }
                    if (img.GetPixel(x, y).ToArgb() == Color.Black.ToArgb())
                    {
                        blackPXCount++;
                    }
                }
            }
            return blackPXCount;
        }
        /// <summary>
        /// 获取一个垂直区域内的黑色像素
        /// </summary>
        /// <param name="startX">开始x</param>
        /// <param name="offset">左偏移像素</param>
        /// <returns></returns>
        private int GetBlackPXCountInX(int startX, int offset, Bitmap img)
        {
            int blackPXCount = 0;
            for (int x = startX; x < startX + offset; x++)
            {
                if (x >= img.Width)
                {
                    continue;
                }
                for (var y = 0; y < img.Height; y++)
                {
                    if (img.GetPixel(x, y).ToArgb() == Color.Black.ToArgb())
                    {
                        blackPXCount++;
                    }
                }
            }
            return blackPXCount;
        }
        /// <summary>
        /// 获取竖向分割点
        /// </summary>
        /// <param name="img"></param>
        /// <returns>List int[xstart xend]</returns>
        private List<int[]> GetXCutPointList(Bitmap img)
        {
            //分割点  List<int[xstart xend]>
            List<int[]> xCutList = new List<int[]>();
            int startX = -1;//-1表示在寻找开始节点
            for (var x = 0; x < img.Width; x++)
            {
                if (startX == -1)//开始点
                {
                    int blackPXCount = GetBlackPXCountInX(x, 2, img);
                    //如果大于有效像素则是开始节点 ,0-x的矩形区域大于3像素，认为是字母，防止一些噪点被切割           
                    if (blackPXCount > 5)
                    {
                        startX = x;

                    }
                }
                else//结束点
                {
                    if (x == img.Width - 1)//判断是否最后一列
                    {
                        xCutList.Add(new int[] { startX, x });
                        break;
                    }
                    else if (x >= startX + MinWidthPerChar)//隔开一定距离才能结束分割
                    {
                        int blackPXCount = GetBlackPXCountInX(x, 2, img);//判断后面区域黑色像素点的个数
                        //小于等于阀值则是结束节点                       
                        if (blackPXCount < 2)
                        {

                            if (x > startX + MaxWidthPerChar)//尽量控制不执行
                            {
                                //大于最大字符的宽度应该是两个字符粘连到一块了 从中间分开
                                int middleX = startX + (x - startX) / 2;
                                xCutList.Add(new int[] { startX, middleX });
                                xCutList.Add(new int[] { middleX + 1, x });
                            }
                            else
                            {
                                //验证黑色像素是否太少
                                blackPXCount = GetBlackPXCountInX(startX, x - startX, img);
                                if (blackPXCount <= 10)
                                {
                                    startX = -1;//重置开始点
                                }
                                else
                                {
                                    xCutList.Add(new int[] { startX, x });
                                }
                            }
                            startX = -1;//重置开始点
                        }
                    }
                }
            }
            return xCutList;
        }
        /// <summary>
        /// 截取图像的矩形区域
        /// </summary>
        /// <param name="source">源图像对应picturebox1</param>
        /// <param name="rect">矩形区域，如上初始化的rect</param>
        /// <returns>矩形区域的图像</returns>
        private Image AcquireRectangleImage(Image source, Rectangle rect)
        {
            if (source == null || rect.IsEmpty) return null;
            //Bitmap bmSmall = new Bitmap(rect.Width, rect.Height, System.Drawing.Imaging.PixelFormat.Format32bppRgb);
            Bitmap bmSmall = new Bitmap(rect.Width, rect.Height, source.PixelFormat);

            using (Graphics grSmall = Graphics.FromImage(bmSmall))
            {
                grSmall.DrawImage(source,
                                  new System.Drawing.Rectangle(0, 0, bmSmall.Width, bmSmall.Height),
                                  rect,
                                  GraphicsUnit.Pixel);
                grSmall.Dispose();
            }
            return bmSmall;
        }
        #endregion

        #region 图片识别
        /// <summary>
        /// 返回两图比较的相似度 最大1
        /// </summary>
        /// <param name="compareImg">对比图</param>
        /// <param name="mainImg">要识别的图</param>
        /// <returns></returns>
        private double CompareImg(Bitmap compareImg, Bitmap mainImg)
        {
            int img1x = compareImg.Width;
            int img1y = compareImg.Height;
            int img2x = mainImg.Width;
            int img2y = mainImg.Height;
            //最小宽度
            double min_x = img1x > img2x ? img2x : img1x;
            //最小高度
            double min_y = img1y > img2y ? img2y : img1y;

            double score = 0;
            //重叠的黑色像素
            for (var x = 0; x < min_x; x++)
            {
                for (var y = 0; y < min_y; y++)
                {
                    if (compareImg.GetPixel(x, y).ToArgb() == Color.Black.ToArgb()
                        && compareImg.GetPixel(x, y).ToArgb() == mainImg.GetPixel(x, y).ToArgb())
                    {
                        score++;
                    }
                }
            }
            double originalBlackCount = 0;
            //对比图片的黑色像素
            for (var x = 0; x < img1x; x++)
            {
                for (var y = 0; y < img1y; y++)
                {
                    if (Color.Black.ToArgb() == compareImg.GetPixel(x, y).ToArgb())
                    {
                        originalBlackCount++;
                    }
                }
            }
            return score / originalBlackCount;
        }
        /// <summary>
        /// 用所有的学习的图片对比当前图，通过黑色和图片比率获取最大相似度的字符图片，从而识别
        /// </summary>
        /// <param name="imgArr">要识别图片的数组</param>
        /// <returns>识别后的字符串</returns>
        public string RecognizeCheckCodeImg(Bitmap bitImg)
        {
            Bitmap EZHimg = EZH(bitImg);
            Bitmap[] imgArr = SplitImage(EZHimg);
            string returnString = string.Empty;
            for (int i = 0; i < imgArr.Length; i++)
            {
                if (imgArr[i] == null)
                {
                    continue;
                }
                var img = imgArr[i];
                if (img == null)
                {
                    continue;
                }
                string[] detailPathList = Directory.GetDirectories(samplePath);
                if (detailPathList == null || detailPathList.Length == 0)
                {
                    continue;
                }
                string resultString = string.Empty;
                //config.txt 文件中指定了识别字母的顺序
                string configPath = samplePath + "config.txt";
                if (!File.Exists(configPath))
                {
                    Console.WriteLine("config.txt文件不存在,无法识别");
                    return null;
                }
                string configString = File.ReadAllText(configPath);
                double maxRate = 0;//相似度  最大1
                foreach (char resultChar in configString)
                {
                    string charPath = samplePath + resultChar.ToString();//特征目录存储路径
                    if (!Directory.Exists(charPath))
                    {
                        continue;
                    }
                    string[] fileNameList = Directory.GetFiles(charPath);
                    if (fileNameList == null || fileNameList.Length == 0)
                    {
                        continue;
                    }


                    foreach (string filename in fileNameList)
                    {
                        Bitmap imgSample = new Bitmap(filename);
                        //过滤宽高相差太大的
                        if (Math.Abs(imgSample.Width - img.Width) >= 2
                            || Math.Abs(imgSample.Height - img.Height) >= 3)
                        {
                            continue;
                        }
                        //当前相似度                       
                        double currentRate = CompareImg(imgSample, img);
                        if (currentRate > maxRate)
                        {
                            maxRate = currentRate;
                            resultString = resultChar.ToString();
                        }
                        imgSample.Dispose();
                    }
                }
                returnString = returnString + resultString;
            }
            return returnString;
        }
        #endregion

    }
}

程序验证：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.Drawing;
using System.IO;

namespace CheckCodeRecognizeLibTest
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("正在下载验证码......");
            //CookieContainer cc = new CookieContainer();
            //byte[] imgByte = HttpWebRequestForBPMS.GetWebResorce("http://hd.cnrds.net/hd/login.do?action=createrandimg",cc);
            //MemoryStream ms1 = new MemoryStream(imgByte);
           // Bitmap bm = (Bitmap)Image.FromStream(ms1); 
            Bitmap img = HttpWebRequestForBPMS.GetWebImage("http://hd.cnrds.net/hd/login.do?action=createrandimg");
            Console.WriteLine("验证码下载成功，正在识别.....");
            CheckCodeRecognizeLib.CheckCodeRecognize regImg = new CheckCodeRecognizeLib.CheckCodeRecognize();
            string regResult= regImg.RecognizeCheckCodeImg(img);
            Console.WriteLine("验证码识别成功，验证码结果为："+regResult);
            
          
        }       
    }
  
    
}

原文：http://www.cnblogs.com/fuchongjundream/p/5403193.html

采集搜狗公众号文章

作者: forthxu
时间: June 30, 2016
分类: 默认分类
评论

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# https://github.com/forthxu/WechatSearchProjects 还包同样功能改用Scrapy采集

import sys
import re
import urllib, urllib2
import requests
import pymongo
import datetime
from bs4 import BeautifulSoup
import multiprocessing as mp


class MongoDBIO:
    # 申明相关的属性
    def __init__(self, host, port, name, password, database, collection):
        self.host = host
        self.port = port
        self.name = name
        self.password = password
        self.database = database
        self.collection = collection

    # 连接数据库，db和posts为数据库和集合的游标
    def Connection(self):
        # connection = pymongo.Connection() # 连接本地数据库
        connection = pymongo.Connection(host=self.host, port=self.port)
        # db = connection.datas
        db = connection[self.database]
        if self.name or self.password:
            db.authenticate(name=self.name, password=self.password) # 验证用户名密码
        # print "Database:", db.name
        # posts = db.cn_live_news
        posts = db[self.collection]
        # print "Collection:", posts.name
        return posts

# # 保存操作
# def ResultSave(save_host, save_port, save_name, save_password, save_database, save_collection, save_contents):
#     posts = MongoDBIO(save_host, save_port, save_name, save_password, save_database, save_collection).Connection()
#
#     for save_content in save_contents:
#         posts.save(save_content)
# 保存操作
def ResultSave(save_host, save_port, save_name, save_password, save_database, save_collection, save_content):
    posts = MongoDBIO(save_host, save_port, save_name, save_password, save_database, save_collection).Connection()
    posts.save(save_content)


def GetTitleUrl(url, data):
    content = requests.get(url=url, params=data).content # GET请求发送
    soup = BeautifulSoup(content)
    tags = soup.findAll("h4")
    titleurl = []
    for tag in tags:
        item = {"title":tag.text.strip(), "link":tag.find("a").get("href"), "content":""}
        titleurl.append(item)
    return titleurl

def GetContent(url):
    soup = BeautifulSoup(requests.get(url=url).content)
    tag = soup.find("div", attrs={"class":"rich_media_content", "id":"js_content"}) # 提取第一个标签
    content_list = [tag_i.text for tag_i in tag.findAll("p")]
    content = "".join(content_list)
    return content

def ContentSave(item):
    # 保存配置
    save_host = "localhost"
    save_port = 27017
    save_name = ""
    save_password = ""
    save_database = "testwechat"
    save_collection = "result"

    save_content = {
        "title":item["title"],
        "link":item["link"],
        "content":item["content"]
    }

    ResultSave(save_host, save_port, save_name, save_password, save_database, save_collection, save_content)

def func(tuple):
    querystring, type, page = tuple[0], tuple[1], tuple[2]
    url = "http://weixin.sogou.com/weixin"
    # get参数
    data = {
        "query":querystring,
        "type":type,
        "page":page
    }

    titleurl = GetTitleUrl(url, data)

    for item in titleurl:
        url = item["link"]
        print "url:", url
        content = GetContent(url)
        item["content"] = content
        ContentSave(item)


if __name__ == '__main__':
    start = datetime.datetime.now()

    querystring = u"清华"
    type = 2 # 2-文章，1-微信号

    # 多进程抓取
    p = mp.Pool()
    p.map_async(func, [(querystring, type, page) for page in range(1, 50, 1)])
    p.close()
    p.join()

    # # 单进程抓取
    # for page in range(1, 50, 1):
    #     tuple = (querystring, type, page)
    #     func(tuple)

    end = datetime.datetime.now()
    print "last time: ", end-start

采集传送门(chuansong.me)指定公众号文章

作者: forthxu
时间: June 28, 2016
分类: 默认分类
评论

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib2
import time
import csv
import sys,getopt,os
import pymysql

# 获取当前目录
def get_cur_file_dir():
    path = sys.path[0]
    if os.path.isdir(path):
        return path
    elif os.path.isfile(path):
        return os.path.dirname(path)

# 抓取内容函数
def open_url(url):
    req = urllib2.Request(url)
    req.add_header('User-agent', 'Mozilla 5.10')
    # 尝试三次
    for i in range(0, 3):
        try:
            xhtml = urllib2.urlopen(req)
            return xhtml
        except urllib2.HTTPError,e:    #HTTPError必须排在URLError的前面
            print "The server couldn't fulfill the request"
            print "Error code:",e.code
            if e.code!=503:
                return False
            time.sleep(5)
            print("try again")
        except urllib2.URLError,e:
            print "Failed to reach the server"
            print "The reason:",e.reason
            if e.code!=503:
                return False
            time.sleep(5)
            print("try again")
    
    return Fasle

# 处理内容页
def down_content(content_url,path_url):
    xhtml=open_url(content_url)
    # 抓取内容失败
    if False == xhtml :
        return False

    # 分析内容
    soup = BeautifulSoup(xhtml, "html5lib")
    titleH2 = soup.find("h2", id="activity-name")
    if None == titleH2:
        return False
    title = titleH2.string.encode('utf-8')
    string_time = soup.find("em", id="post-date").string.encode('utf-8')
    num_time = int(time.mktime(time.strptime(string_time,'%Y-%m-%d')))
    keywords = str(soup.find(attrs={"name":"keywords"})['content'].encode('utf8','ignore'))
    description = str(soup.find(attrs={"name":"description"})['content'].encode('utf8','ignore'))
    content = soup.find_all("div", class_="rich_media_content")
    
    if len(content) < 1 :
        print("      "+"no contet")
        return False
    
    # 记录内容日志
    html = """
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>"""+title+"""</title>
<meta name="keywords" content=\""""+keywords+"""\">
<meta name="description" content=\""""+description+"""\">
</head>
<body>
    <div id="body">
    <h1>"""+title+"""</h1>
    <div id="string_time">"""+string_time+""" </div><div id="num_time">"""+str(num_time)+"""</div>
    <div id="content">
    """+str(content[0])+"""
    </div>
    </div>
</body>
<script type="text/javascript" src="js/reimg.js"></script>
</html>
    """
        
    f=file(path_url,"w+")
    f.write(html)
    f.close()
    
    # 写入数据库
    cur.execute("INSERT INTO archive (category,category_parents,title,summary,addtime,uptime) VALUES (27,\"0,12,27,\",%s,%s,%s,%s)",(title.strip(),description.strip(),num_time,num_time))
    #print cur.description
    #print "ID of last record is ", int(cur.lastrowid) #最后插入行的主键ID  
    #print "ID of inserted record is ", int(conn.insert_id()) #最新插入行的主键ID，conn.insert_id()一定要在conn.commit()之前，否则会返回0 
    lastid = int(cur.lastrowid)
    
    cur.execute("INSERT INTO archive_article (archive,intro,content) VALUE (%s,'',%s)",(lastid, str(content[0])))
    
    cur.connection.commit()
    
    return True

# 处理列表页
def down_list(list_url):
    # 列表内容
    xhtml=open_url(list_url)
    if False == xhtml :
        return False

    # 内容连接
    soup = BeautifulSoup(xhtml, "html5lib")
    title = soup.title.string.encode('utf-8')
    li_a = soup.find_all("a", class_="question_link")
    next_list = soup.find_all("a", text="下一页")
    
    # 记录日志
    writer = csv.writer(file(datapath+'list.csv', 'a+b'))
    x = 0
    y = 0
    # 循环抓取内容页
    print(list_url+" start")
    for i in range(0, len(li_a)):
        content_id = li_a[i]['href'].encode('utf-8')[3:]
        content_title = li_a[i].string.encode('utf-8')
        content_url = "http://chuansong.me"+li_a[i]['href'].encode('utf-8')
        path_url = datapath+content_id+".html"
        
        if not os.path.exists(path_url):
            # 抓取内容失败，继续
            if False == down_content(content_url,path_url) :
                print("  "+str(x)+content_url+" down fail")
                continue
                #return False
                
            print("  "+str(x)+content_url+" down end")
            # 记录日志
            writer.writerow([content_id, content_title, content_url])
            # 定时休息
            x=x+1
            if x%2 == 1 :
                time.sleep(3)
            time.sleep(1)
        else:
            print("  "+content_url+" exist")
            y=y+1
            # 重复存在三次结束抓取
            if y>2 :
                return False
    print(list_url+" end")
    
    # 不存在下一个列表
    if len(next_list) < 1 :
        return False

    # print("next "+next_list[0]['href'].encode('utf-8')+"\n")
    return True
    
# 抓取列表页
def get_list(wechart):
    start=0
    # 循环抓取列表
    while True:
        if start==0:
            url = 'http://chuansong.me/account/'+wechart
        else:
            url = 'http://chuansong.me/account/'+wechart+'?start='+str(start)
        
        # 完成或者超过2000条数据
        start+=12
        if False == down_list(url) or start>2000:
            break

        time.sleep(1)
        
    print("get_list end")

# 帮助
def usage():
    help = """
-d temp dir,default: """+get_cur_file_dir()+"""
-w wechart,default: xingdongpai77
-u mysql user,default: root
-p mysql pwd,default: 
-h,--help for help
"""
    print help
    
if __name__ == "__main__":
    opts, args = getopt.getopt(sys.argv[1:], "d:w:u:p:h", ["help"])
    arg_dir = get_cur_file_dir()
    arg_wechart = 'xingdongpai77'
    arg_user = 'root'
    arg_pwd = ''
    for op, value in opts:
        if op == "-d":
            arg_dir = value
        elif op == "-w":
            arg_wechart = value
        elif op == "-u":
            arg_user = value
        elif op == "-p":
            arg_pwd = value
        elif op == "-h" or op == "--help":
            usage()
            sys.exit()

    print time.strftime("%Y-%m-%d %H:%M:%S")

    # 初始化临时文件夹
    datapath = arg_dir+'/data/'
    if not os.path.exists(datapath):
        os.makedirs(datapath)

    # 初始化数据库
    try:
        conn = pymysql.connect(host='127.0.0.1', port=3306, user=arg_user, passwd=arg_pwd, db='mysql')
        cur = conn.cursor()
        cur.execute("SET NAMES utf8")
        cur.execute("USE x")
    except pymysql.Error, e:
        print __file__, e
        usage()
        sys.exit()

    # 开始抓取
    get_list(arg_wechart)
    
    # 关闭数据库
    cur.close()
    conn.close()
    
    # xtime = time.strftime("%Y-%m-%d %H:%M:%S")
    # xday = time.strftime("%Y-%m-%d")
    # f=file(datapath+xtime+".html","w+")
    # f.write(body)
    # f.close()

使用phantomjs对网页进行截图和采集

作者: forthxu
时间: June 28, 2016
分类: 默认分类
评论

phantomjs对网页进行截图

[root@vps3 work]# wget https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2
[root@vps3 work]# tar jxvf phantomjs-2.1.1-linux-x86_64.tar.bz2
[root@vps3 work]# vim screenshots.js

var page = require('webpage').create();
var args = require('system').args;

var url = args[1];
var filename = args[2];

page.open(url, function(status) {
    console.log("Status: " + status);
    if(status === "success") {
        #执行js
        var title = page.evaluate(function(){
            #滚动加载惰性图片
            window.scrollTo(0,10000);
            #返回标题
            return document.title;
        });
        #调试信息
        console.log('Page title is ' + title);
        
        #延迟处理，以便加载图片执行js    
        window.setTimeout(function ()
        {
            #截图渲染
            page.render(filename);
            #退出
            phantom.exit();
        }, 5000);
    }else{
        phantom.exit();
    }
});

安装微软雅黑字体（截图无文字时）
[root@vps3 work]#yum -y install bitmap-fonts bitmap-fonts-cjk mkfontscale fontconfig
[root@vps3 work]#mkdir /usr/share/fonts/win/
[root@vps3 work]#wget https://nipao.googlecode.com/files/msyh.ttf -O /usr/share/fonts/win/msyh.ttf
[root@vps3 work]#mkfontscale
[root@vps3 work]#mkfontdir
[root@vps3 work]#fc-cache

执行截图功能
[root@vps3 work]#rm -rf /home/wwwroot/default/joke.png && phantomjs-2.1.1-linux-x86_64/bin/phantomjs screenshots.js http://joke.4399pk.com /home/wwwroot/default/joke.png

注意：经过测试我有台vps保存的图片是透明无内容的图片，其他机器正常，原因不明。

Selenium加载phantomjs采集信息

Selenium最早是作为自动化测试，可以加载各种浏览器作为驱动来测试，官方加上第三方宣布支持的驱动有很多种，除了PC端的浏览器之外，还支持iphone、Android的driver。同时selenium加载浏览器驱动也可以用来采集信息，以便获取需要js执行后的dom文档（Ajax）或者惰性加载的图片等内容。

selenium支持伪浏览器PhantomJS。PhantomJS不是真正的在浏览器、都没有GUI，但支持html、js等解析能力的类浏览器程序，他不会渲染出网页的显示内容，但是支持页面元素的查找、JS的执行等。

PhantomJS浏览器驱动独自安装，是独立完全的程序。

[root@vps3 work]#pip install selenium
[root@vps3 work]#vim collect.py

# -*- coding: utf-8 -*-
from selenium import webdriver
import time
 
def capture(url, save_fn="capture.png"):
    browser = webdriver.PhantomJS(executable_path=r'/workspace/work/phantomjs-2.1.1-linux-x86_64/bin/phantomjs')
    browser.get(url)  
    ele = browser.find_element_by_id('weixin-account-btn')  
    print ele.get_attribute('style')  
    browser.quit()

if __name__ == "__main__":
    capture("http://joke.4399pk.com/")

[root@vps3 work]#python collect.py

注意：如果你用phantomjs作为selenium浏览器，selenium的phantomjs只支持python2.7

*nodejs最新截图方法
*golang调动无头浏览器

Memcache协议中文版

作者: forthxu
时间: June 7, 2016
分类: 默认分类
评论

写在前头
偶然之间看到本文的中英文对照版本，感觉看起来不是很方便，于是花费了半个小时的时间，仔细整理出了独立的中文版本，并记录下来。

协议
memcached 的客户端使用TCP链接与服务器通讯。（UDP接口也同样有效，参考后文的 “UDP协议” ）一个运行中的memcached服务器监视一些（可设置）端口。客户端连接这些端口，发送命令到服务器，读取回应，最后关闭连接。

结束会话不需要发送任何命令。当不再需memcached服务时，要客户端可以在任何时候关闭连接。需要注意的是，鼓励客户端缓存这些连接，而不是每次需要存取数据时都重新打开连接。这是因为memcached 被特意设计成及时开启很多连接也能够高效的工作（数百个，上千个如果需要的话）。缓存这些连接，可以消除建立连接所带来的开销（/*/相对而言，在服务器端建立一个新连接的准备工作所带来的开销，可以忽略不计。）。

- 阅读剩余部分 -

正向代理和反向代理的区别和使用

作者: forthxu
时间: June 6, 2016
分类: 默认分类
评论

正向代理和反向代理的区别和使用
请求中扮演的角色	客户端	代理服务端	内容服务端
正向代理	需设置代理，通过代理服务端请求内容服务端的资源	代替客户端请求内容服务端	只能获取代理服务端的请求内容并返回
反向代理	无需设置，请求代理服务端的资源	本身并无资源，获取内容服务端资源返回给客户端	只能获取代理服务端的请求内容并返回
用途
正向代理	正向代理的典型用途是为在防火墙内的局域网客户端提供访问Internet的途径。
	正向代理还可以使用缓冲特性减少网络使用率。
反向代理	反向典型用途是将防火墙后面的服务器提供给Internet用户访问。
	反向代理还可以为后端的多台服务器提供负载平衡，或为后端较慢的服务器提供缓冲服务。
	反向代理还可以启用高级URL策略和管理技术，从而使处于不同web服务器系统的web页面同时存在于同一个URL空间下。
安全性
正向代理	正向代理允许客户端通过它访问任意网站并且隐藏客户端自身，因此你必须采取安全措施以确保仅为经过授权的客户端提供服务。
反向代理	反向代理对外都是透明的，访问者并不知道自己访问的是一个代理

mysql批量更新多条记录的同一字段为不同值

作者: forthxu
时间: May 28, 2016
分类: 默认分类
1 条评论

mysql更新数据的某个字段，一般这样写：

UPDATE mytable SET myfield = 'value' WHERE other_field = 'other_value';

也可以这样用in指定要更新的记录：

UPDATE mytable SET myfield = 'value' WHERE other_field in ('other_values');

这里注意 ‘other_values’ 是一个逗号（，）分隔的字符串，如：1,2,3

如果更新多条数据而且每条记录要更新的值不同，可能很多人会这样写：

foreach ($values as $id => $myvalue) {
    $sql = "UPDATE mytable SET myfield = $myvalue WHERE id = $id";
    mysql_query($sql);
}

即是循环一条一条的更新记录。一条记录update一次，这样性能很差，也很容易造成阻塞。

那么能不能一条sql语句实现批量更新呢？mysql并没有提供直接的方法来实现批量更新，但是可以用点小技巧来实现。

UPDATE mytable
    SET myfield = CASE id
        WHEN 1 THEN 'myvalue1'
        WHEN 2 THEN 'myvalue2'
        WHEN 3 THEN 'myvalue3'
    END
WHERE other_field ('other_values')

如果where条件查询出记录的id不在CASE范围内，myfield将被设置为空。

如果更新多个值的话，只需要稍加修改：

UPDATE mytable
    SET myfield1 = CASE id
        WHEN 1 THEN 'myvalue11'
        WHEN 2 THEN 'myvalue12'
        WHEN 3 THEN 'myvalue13'
    END,
    myfield2 = CASE id
        WHEN 1 THEN 'myvalue21'
        WHEN 2 THEN 'myvalue22'
        WHEN 3 THEN 'myvalue23'
    END
WHERE id IN (1,2,3)

这里以php为例，构造这两条mysql语句：

更新多条单个字段为不同值, mysql模式

$ids_values = array(
    1 => 11,
    2 => 22,
    3 => 33,
    4 => 44,
    5 => 55,
    6 => 66,
    7 => 77,
    8 => 88,
);
 
$ids = implode(',', array_keys($ids_values ));
$sql = "UPDATE mytable SET myfield = CASE id ";
foreach ($ids_values as $id=> $myvalue) {
    $sql .= sprintf("WHEN %d THEN %d ", $id, $myvalue);
}
$sql .= "END WHERE id IN ($ids)";
echo $sql.";<br/>";

输出：

UPDATE mytable SET myfield = CASE id WHEN 1 THEN 11 WHEN 2 THEN 22 WHEN 3 THEN 33 WHEN 4 THEN 44 WHEN 5 THEN 55 WHEN 6 THEN 66 WHEN 7 THEN 77 WHEN 8 THEN 88 END WHERE id IN (1,2,3,4,5,6,7,8);

更新多个字段为不同值, PDO模式

$data = array(array('id' => 1, 'myfield1val' => 11, 'myfield2val' => 111), array('id' => 2, 'myfield1val' => 22, 'myfield2val' => 222));
$where_in_ids = implode(',', array_map(function($v) {return ":id_" . $v['id'];}, $data));
$update_sql = 'UPDATE mytable SET';
$params = array();

$update_sql .= ' myfield1 = CASE id';
foreach($data as $key => $item) {
    $update_sql .= " WHEN :id_" . $key . " THEN :myfield1val_" . $key . " ";
    $params[":id_" . $key] = $item['id'];
    $params[":myfield1val_" . $key] = $item['myfield1val'];
}
$update_sql .= " END";

$update_sql .= ',myfield2 = CASE id';
foreach($data as $key => $item) {
    $update_sql .= " WHEN :id_" . $key . " THEN :myfield2val_" . $key . " ";
    $params[":id_" . $key] = $item['id'];
    $params[":myfield1va2_" . $key] = $item['myfield2val'];
}
$update_sql .= " END";

$update_sql .= " WHERE id IN (" . $where_in_ids . ")";
echo $update_sql.";<br/>";
var_dump($params);

输出：

UPDATE mytable SET myfield1 = CASE id WHEN :id_0 THEN :myfield1val_0 WHEN :id_1 THEN :myfield1val_1 END,myfield2 = CASE id WHEN :id_0 THEN :myfield2val_0 WHEN :id_1 THEN :myfield2val_1 END WHERE id IN (:id_1,:id_2);

array (size=6)
 ':id_0' => int 1
 ':myfield1val_0' => int 11
 ':id_1' => int 2
 ':myfield1val_1' => int 22
 ':myfield1va2_0' => int 111
 ':myfield1va2_1' => int 222

另外三种批量更新方式

1. replace into 批量更新

replace into mytable(id, myfield) values (1,'value1'),(2,'value2'),(3,'value3');

2. insert into ...on duplicate key update 批量存在则更新

insert into mytable(id, myfield1, myfield2) values (1,'value11','value21'),(2,'value12','value22'),(3,'value13','value23') on duplicate key update myfield1=values(myfield1),myfield2=values(myfield2);

不需要以下语句就能批量更新

insert into mytable(id, myfield1, myfield2) values (1,'value11','value21'),(2,'value12','value22'),(3,'value13','value23') on duplicate key update myfield1=case id when values(id) then values(myfield1) end,myfield2=case id when values(id) then values(myfield2) end;

注意：即使没插入也会造成自增id的增加。

3. 临时表

DROP TABLE IF EXISTS `tmptable`;
create temporary table tmptable(id int(4) primary key,myfield varchar(50));
insert into tmptable values (1,'value1'),(2,'value2'),(3,'value3');
update mytable, tmptable set mytable.myfield = tmptable.myfield where mytable.id = tmptable.id;

【replace into】和【insert into】更新都依赖于主键或唯一值，并都可能造成新增记录的操作的结构隐患
【replace into】操作本质是对重复记录先delete然后insert，如果更新的字段不全缺失的字段将被设置成缺省值
【insert into】则只是update重复的记录，更改的字段只能依循公式值
【临时表】方式需要用户有temporary 表的create 权限
数量较少时【replace into】和【insert into】性能最好，数量大时【临时表】最好，【CASE】则具有通用型也不具结构隐患