visual studio 2010 - Which encoding to use for reading Italian text in Python? -


i'm using python tools visual studio , reading files written in italian. tried iso-8859-1, iso-8859-2, utf-8, utf-8-sig. notepad++ opens file utf-8 without bom.

content = fp.read() words = content.decode("utf-8-sig").lower().split() w in words:     p=''     cur.execute('select word  multiwordnet.italian_lemma l, multiwordnet.italian_synset s l.id = s.id , l.lemma="%s"' % w)  

the string results in crash c'è. (getting read "c\'\xe3\xa8")

using chardet not help

traceback (most recent call last): file "c:\users\tathagata\documents\visual studio 2012\projects\pythonapplicati on4\pythonapplication4\pythonapplication4.py", line 344, in <module> createsynsetdict() file "c:\users\tathagata\documents\visual studio 2012\projects\pythonapplicati on4\pythonapplication4\pythonapplication4.py", line 294, in createsynsetdict cur.execute('select word  multiwordnet.italian_lemma l, multiwordnet.it alian_synset s l.id = s.id , l.lemma="%s"' % w) file "c:\python27\lib\site-packages\pymysql\cursors.py", line 117, in execute self.errorhandler(self, exc, value) file "c:\python27\lib\site-packages\pymysql\connections.py", line 187, in defa ulterrorhandler raise error(errorclass, errorvalue) error: (<type 'exceptions.unicodeencodeerror'>, unicodeencodeerror('ascii', u's\ x00\x00\x00\x03select word  multiwordnet.italian_lemma l, multiwordnet.ital  ian_synset s l.id = s.id , l.lemma="c\'\xe3\xa8"', 116, 118, 'ordinal no t in range(128)')) 

presuming database's style of bind variables format...

content = fp.read() words = content.decode("utf-8-sig").lower().split() w in words:     p=''     cur.execute('select word ' +                 'multiwordnet.italian_lemma l, ' +                 'multiwordnet.italian_synset s ' +                 'where l.id = s.id , l.lemma=%s', w) 

note aren't using % operator between sql string , variable being passed in, , aren't putting inner quotes around %s; rather, %s placeholder identify in sql word should substituted, , we're passing value substituted placeholder separate argument. following practice not prevents needing deal encoding issues (if argument passed python unicode string, database bindings responsible taking there), prevents sql injection security vulnerabilities.

other database libraries python may use different placeholder styles; read documentation or check module-level paramstyle constant yours. (for qmark placeholder should ?; numeric should colon-prefixed numbers (:1 first parameter, :2 second, etc)


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -