本文不是解决乱码问题,是要提取乱码内容。游戏上线后要做舆情分析,想收集前两日玩家游戏内聊天内容,无奈没有专门的日志,只能从网络库日志提取protobuf的内容。

网络库会把每条收到的protobuf协议都调用方法,日志打印所有字段的值,但是汉子会打出类似”\346\234\211\344\272\272\345\220\227”.想把这些编码转化为能看懂的汉子。

1 尝试编码转换

游戏内的聊天文字用的是utf8的格式,不过看着上面的编码有超过256 char最大值的,看着像双字节的unicode,转成unicode后显示也是乱码,后来用utf8,ansi看都是乱码,只好从代码分析是如何打印的了。

2 分析DebugString函数

最终会调用到下面的函数,原来是八进制的,仔细一看,确实是。所以剩下的就好做了。

// Escapes 'src' using C-style escape sequences, and appends the escaped string
// to 'dest'. This version is faster than calling CEscapeInternal as it computes
// the required space using a lookup table, and also does not do any special
// handling for Hex or UTF-8 characters.
// ----------------------------------------------------------------------
void CEscapeAndAppend(StringPiece src, string* dest) {
  size_t escaped_len = CEscapedLength(src);
  if (escaped_len == src.size()) {
    dest->append(src.data(), src.size());
    return;
  }

  size_t cur_dest_len = dest->size();
  dest->resize(cur_dest_len + escaped_len);
  char* append_ptr = &(*dest)[cur_dest_len];

  for (int i = 0; i < src.size(); ++i) {
    unsigned char c = static_cast<unsigned char>(src[i]);
    switch (c) {
      case '\n': *append_ptr++ = '\\'; *append_ptr++ = 'n'; break;
      case '\r': *append_ptr++ = '\\'; *append_ptr++ = 'r'; break;
      case '\t': *append_ptr++ = '\\'; *append_ptr++ = 't'; break;
      case '\"': *append_ptr++ = '\\'; *append_ptr++ = '\"'; break;
      case '\'': *append_ptr++ = '\\'; *append_ptr++ = '\''; break;
      case '\\': *append_ptr++ = '\\'; *append_ptr++ = '\\'; break;
      default:
        if (!isprint(c)) {
          *append_ptr++ = '\\';
          *append_ptr++ = '0' + c / 64;
          *append_ptr++ = '0' + (c % 64) / 8;
          *append_ptr++ = '0' + c % 8;
        } else {
          *append_ptr++ = c;
        }
        break;
    }
  }
}

3 提取数据,输出

第一步先用c++11 的正则表达式提取数据,然后再用strtol 将八进制转化输出即可,代码如下:

#include "stdafx.h"
#include "../../../Com/CFunc.hpp"
#include "../../../Com/CStr.hpp"
#include "../../../Com/CFile.hpp"
#include <regex>
using namespace std;



int main()
{
    linevec_t lines;
    XReadFileLine("chat.csv", lines);

    std::regex reg("\\\\[0-7]{3}", std::regex_constants::ECMAScript);

    sregex_token_iterator end;
    for (auto s:lines)
    {
        string chat;
        sregex_token_iterator it(s.begin(), s.end(), reg);
        while (it != end)
        {
            //cout << *it<<" ";
            string ss = *it;
            ++it;

            char sz[4] = {0};
            sz[0] = ss[1];
            sz[1] = ss[2];
            sz[2] = ss[3];
            //cout << sz;

            int ret = strtol(sz, NULL, 8);
            chat += char(ret);
        }
        cout << chat << endl;
    }
    return 0;
}