فهرست منبع

fix UnicodeEncodeError when retrieving words from utf-8 encoded file

Summary:
This commit fixes the issue https://github.com/facebookresearch/fastText/issues/746
pybind11's `py::str` constructor [has a different behaviour](https://github.com/pybind/pybind11/blob/ccbe68b084806dece5863437a7dc93de20bd9b15/include/pybind11/pytypes.h#L930) between Python 2 and Python 3. When casting from C++ string to py::str, we should decode as utf-8, but we should also encode it back in order to construct `py::str` correctly.

Reviewed By: EdouardGrave

Differential Revision: D14783627

fbshipit-source-id: 8a7d4b16f42d6d892203cf3d72f144427008dd7f
Onur Çelebi 6 سال پیش
والد
کامیت
71c0ee5a8b
1فایلهای تغییر یافته به همراه8 افزوده شده و 0 حذف شده
  1. 8 0
      python/fastText/pybind/fasttext_pybind.cc

+ 8 - 0
python/fastText/pybind/fasttext_pybind.cc

@@ -26,6 +26,14 @@ py::str castToPythonString(const std::string& s, const char* onUnicodeError) {
   if (!handle) {
     throw py::error_already_set();
   }
+
+  // py::str's constructor from a PyObject assumes the string has been encoded
+  // for python 2 and not encoded for python 3 :
+  // https://github.com/pybind/pybind11/blob/ccbe68b084806dece5863437a7dc93de20bd9b15/include/pybind11/pytypes.h#L930
+#if PY_MAJOR_VERSION < 3
+  handle = PyUnicode_AsEncodedString(handle, "utf-8", onUnicodeError);
+#endif
+
   return py::str(handle);
 }