Python 2 と Python 3 での文字エンコード取り扱いと、テキストファイルの読み込みの違い (r271-635)

Pythonの文字エンコード取り扱いは、バージョン２から３に上がる時に互換性をかなぐり捨てたせいで、ややこしい。
プログラミング時のために、覚書をまとめてみた

デフォルト・エンコーディングを確認する

Python 2 （Ubuntu）

>>> sys.version
'2.7.17 (default, Sep 30 2020, 13:38:04) \n[GCC 7.5.0]'
 
>>> import sys
>>> print(sys.stdout.encoding)
UTF-8
>>> print(sys.stdin.encoding)
UTF-8
>>> print(sys.getdefaultencoding())
ascii
>>> sys.getdefaultencoding()
'ascii'
>>> print(sys.getfilesystemencoding())
UTF-8
>>> import locale
>>> print(locale.getpreferredencoding())
UTF-8

Python 3 （Ubuntu）

>>> sys.version
'3.6.9 (default, Oct  8 2020, 12:12:24) \n[GCC 8.4.0]'
 
>>> import sys
>>> print(sys.stdout.encoding)
UTF-8
>>> print(sys.stdin.encoding)
UTF-8
>>> print(sys.getdefaultencoding())   # str.encode(), str.decode() で引数なしの場合のデフォルトエンコード
utf-8
>>> print(sys.getfilesystemencoding())   # ファイル名のエンコード
utf-8
>>> import locale
>>> print(locale.getpreferredencoding())   # テキストファイルを開く時のデフォルトエンコード
UTF-8

Python 3 （Windows 10）の Ubuntu からの差分

>>> sys.version
'3.9.0 (tags/v3.9.0:9cf6752, Oct  5 2020, 15:34:40) [MSC v.1927 64 bit (AMD64)]'
 
>>> print(locale.getpreferredencoding())   # テキストファイルを開く時のデフォルトエンコード
cp932

文字列のエンコードとデコード

Python 2 （Ubuntu）

>>> s="文字列"
>>> print(s)
文字列
>>> print(type(s))
<type 'str'>   # Pythonスクリプト内で扱える str になっている
 
>>> print(s.decode('utf-8'))   # str から unicode に変換する
文字列
>>> print(type(s.decode('utf-8')))
<type 'unicode'>
 
>>> print(s.encode('utf-8'))   # str から str に変換することはできない
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
>>> print(type(s.encode('utf-8')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
 
>>> print(type(s.decode('utf-8').encode('utf-8')))   # str を unicode に（decode関数）、さらに unicode から str に（encode関数）変換する
<type 'str'>
>>> print(s.decode('utf-8').encode('utf-8'))
文字列
 
>>> print(s[1])   # Python 2 では、str は（マルチバイトの）文字列を「文字」ごとに扱えない
�
>>> print(s.decode('utf-8')[1])   # unicode は（マルチバイトの）文字列を「文字」ごとに扱える
字
 
>>> s='abcdefg'
>>> print(s[1])   # str は（シングルバイトの）文字列であれば「文字」ごとに扱る
b

Python 3 （UbuntuとWindows 10で同一結果）

>>> s="文字列"
>>> print(s)
文字列
>>> print(type(s))
<class 'str'>
 
>>> print(s.decode('utf-8'))   # str から str には変換できない
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
 
>>> print(s.encode('utf-8'))   # str から bytes に変換する
b'\xe6\x96\x87\xe5\xad\x97\xe5\x88\x97'
>>> print(type(s.encode('utf-8')))
<class 'bytes'>
 
>>> print(s.encode('utf-8').decode('utf-8'))   # str を bytes に（encode関数）、さらに bytes から str に（encode関数）変換する
文字列
>>> print(type(s.encode('utf-8').decode('utf-8')))
<class 'str'>
 
>>> print(s[1])   # Python 3 では、 str は（マルチバイトの）文字列を「文字」ごとに扱える
字

unicode文字列のエンコードとデコード

Python 2 （Ubuntu）

>>> su=u"文字列"
>>> print(su)
文字列
>>> print(type(su))
<type 'unicode'>
 
>>> print(su.decode('utf-8'))   # unicode から unicode には変換できない
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
 
>>> print(su.encode('utf-8'))   # unicode から str には変換する
文字列
>>> print(type(su.encode('utf-8')))
<type 'str'>
 
>>> print(su.encode('utf-8').decode('utf-8'))   # unicode を str に（encode関数）、さらに str から unicode に（decode関数）変換する
文字列
>>> print(type(su.encode('utf-8').decode('utf-8')))
<type 'unicode'>

Python 3 （Ubuntu）

>>> su=u"文字列"
>>> print(su)
文字列
>>> print(type(su))
<class 'str'>
 
>>> print(su.decode('utf-8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'
 
>>> print(su.encode('utf-8'))   # str から bytes に変換する
b'\xe6\x96\x87\xe5\xad\x97\xe5\x88\x97'
>>> print(type(su.encode('utf-8')))
<class 'bytes'>
 
>>> print(su.encode('utf-8').decode('utf-8'))   # str を bytes に（encode関数）、さらに bytes から str に（encode関数）変換する
文字列
>>> print(type(su.encode('utf-8').decode('utf-8')))
<class 'str'>

テキストファイルの読み込み

Python 2 （Ubuntu）

>>> f=io.open("test.txt", "r", encoding="utf-8")   # utf-8 のテキストファイルを開く
>>> print(type(f))
<type '_io.TextIOWrapper'>
 
>>> s=f.readline()
 
>>> print(type(s))   # ファイルから読み込んだ文字列は unicode 形式
<type 'unicode'>
>>> print(s)
テスト文字列1行目
 
>>> print(type(s.encode('utf-8')))   # unicode を str に変換する
<type 'str'>
>>> print(s.encode('utf-8'))
テスト文字列1行目
 
>>> f.closed   # ファイルが閉じられているかチェック
False
>>> f.close()
>>> f.closed
True

Python 3 （Ubuntu）

>>> f=io.open("test.txt", "r", encoding="utf-8")   # utf-8 のテキストファイルを開く
>>> print(type(f))
<class '_io.TextIOWrapper'>
 
>>> s=f.readline()
 
>>> print(type(s))   # ファイルから読み込んだ文字列は str 形式
<class 'str'>
>>> print(s)
テスト文字列1行目
 
>>> f.closed   # ファイルが閉じられているかチェック
False
>>> f.close()
>>> f.closed
True

未知のエンコードのテキストファイルの場合、エンコードを調べてから読み込む

Python 2 および 3 （Ubuntu）

>>> import chardet
>>> f=open('test.txt', 'rb')
>>> str_encode=chardet.detect(f.read())['encoding']
>>> f.close()
 
>>> print(str_encode)   # chardetで検出されたエンコードはShift JIS
SHIFT_JIS
 
# 短いテキストファイルの場合は判定に失敗し、SHIFT_JISではなくWindows-1252 と判定されることもある
 
>>> f=io.open('test.txt', 'r', encoding=str_encode)   # chardetで検出されたエンコードを指定して、テキストファイルを開く
 
>>> s=f.readline()
# 以下の処理を省略

テキストファイルの書き込み

Python 2 （Ubuntu）

>>> s="ファイルに書き込む文字列"
>>> print(type(s))
<type 'str'>
 
>>> f=io.open('test_output.txt', 'w', encoding='shift-jis')   # shift-jis エンコードで書き込む
>>> print(type(f))
<type '_io.TextIOWrapper'>
 
>>> f.write(s)   # write関数にstrを渡すと、エラーとなる
Traceback (most recent call last):
  File "", line 1, in 
TypeError: write() argument 1 must be unicode, not str
 
>>> f.write(s.decode('utf-8'))   # write関数にはstr から unicode に変換してから渡す
12L   # マルチバイト文字も、ちゃんと書き込んだ文字数が数えられて表示される
>>> f.close()

Python 3 （Ubuntu）

>>> s="ファイルに書き込む文字列"
 
>>> f=io.open('test_output.txt', 'w', encoding='shift-jis')   # shift-jis エンコードで書き込む
 
>>> f.write(s)   # Python 3 の場合、write関数にはstr渡してよい
12
>>> f.close()

テキストファイルをバイナリファイルとして読み込んだ場合

Python 2 （Ubuntu）

>>> f=io.open("test.txt", "rb")
>>> print(type(f))
<type '_io.BufferedReader'>
 
>>> s=f.readline()
 
>>> print(type(s))   # Python 2 の場合、strとバイナリデータは透過
<type 'str'>
>>> print(s)
テスト文字列1行目
  
>>> f.close()

Python 3 （Ubuntu）

>>> f=io.open("test.txt", "rb")
>>> print(type(f))
<class '_io.BufferedReader'>
 
>>> s=f.readline()
 
>>> print(type(s))   # Python 3 の場合、バイナリデータはbytesとなる
<class 'bytes'>
>>> print(s)
b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe6\x96\x87\xe5\xad\x97\xe5\x88\x971\xe8\xa1\x8c\xe7\x9b\xae\n'
 
>>> print(type(s.decode('utf-8')))   # bytes から str に（encode関数）変換する
<class 'str'>
>>> print(s.decode('utf-8'))
テスト文字列1行目
 
>>> f.close()

r271-635

An Archive Area

インデックス

17 January 2021

Python 2 と Python 3 での文字エンコード取り扱いと、テキストファイルの読み込みの違い

デフォルト・エンコーディングを確認する

文字列のエンコードとデコード

unicode文字列のエンコードとデコード

テキストファイルの読み込み

テキストファイルの書き込み

テキストファイルをバイナリファイルとして読み込んだ場合