visual studio 2010 - Which encoding to use for reading Italian text in Python? -
i'm using python tools visual studio , reading files written in italian. tried iso-8859-1, iso-8859-2, utf-8, utf-8-sig. notepad++ opens file utf-8 without bom.
content = fp.read() words = content.decode("utf-8-sig").lower().split() w in words: p='' cur.execute('select word multiwordnet.italian_lemma l, multiwordnet.italian_synset s l.id = s.id , l.lemma="%s"' % w)
the string results in crash c'è
. (getting read "c\'\xe3\xa8"
)
using chardet not help
traceback (most recent call last): file "c:\users\tathagata\documents\visual studio 2012\projects\pythonapplicati on4\pythonapplication4\pythonapplication4.py", line 344, in <module> createsynsetdict() file "c:\users\tathagata\documents\visual studio 2012\projects\pythonapplicati on4\pythonapplication4\pythonapplication4.py", line 294, in createsynsetdict cur.execute('select word multiwordnet.italian_lemma l, multiwordnet.it alian_synset s l.id = s.id , l.lemma="%s"' % w) file "c:\python27\lib\site-packages\pymysql\cursors.py", line 117, in execute self.errorhandler(self, exc, value) file "c:\python27\lib\site-packages\pymysql\connections.py", line 187, in defa ulterrorhandler raise error(errorclass, errorvalue) error: (<type 'exceptions.unicodeencodeerror'>, unicodeencodeerror('ascii', u's\ x00\x00\x00\x03select word multiwordnet.italian_lemma l, multiwordnet.ital ian_synset s l.id = s.id , l.lemma="c\'\xe3\xa8"', 116, 118, 'ordinal no t in range(128)'))
presuming database's style of bind variables format
...
content = fp.read() words = content.decode("utf-8-sig").lower().split() w in words: p='' cur.execute('select word ' + 'multiwordnet.italian_lemma l, ' + 'multiwordnet.italian_synset s ' + 'where l.id = s.id , l.lemma=%s', w)
note aren't using %
operator between sql string , variable being passed in, , aren't putting inner quotes around %s
; rather, %s
placeholder identify in sql word should substituted, , we're passing value substituted placeholder separate argument. following practice not prevents needing deal encoding issues (if argument passed python unicode string, database bindings responsible taking there), prevents sql injection security vulnerabilities.
other database libraries python may use different placeholder styles; read documentation or check module-level paramstyle
constant yours. (for qmark
placeholder should ?
; numeric
should colon-prefixed numbers (:1
first parameter, :2
second, etc)
Comments
Post a Comment