Guaranteed unicode or ascii backoff in Python 2.7 -


i'm big fan of unicodedammit module in beautifulsoup4, puts string firmly in unicode , html unescaping :

from bs4 import unicodedammit unicode_page = unicodedammit(raw_page, [suspected_encodings_if_any]).unicode_markup 

there cases mighty dammit fails, though, , returns empty string. want have kind of backoff ascii cases.

dammit uses chardet, no point in backing off that. (dammit looks iconv_codec library - have experience it?) what's best way of backing off ascii? try loses things, seems work:

def to_unicode_with_ascii_backoff(text):     if isinstance(text, unicode):         return text     else:         ud = unicodedammit(text).unicode_markup         if ud:              return ud         else:             return ''.join(i in text if ord(i) < 128) 

"best" depends on application. incrementally improve function:

def to_unicode_with_ascii_backoff(text):     u = unicodedammit(text).unicode_markup     return u if u or not text else text.decode('ascii', 'replace') 

it returns unicode string or raises error if input not bytestring or unicode string.


Comments

Popular posts from this blog

ios - iPhone/iPad different view orientations in different views , and apple approval process -

java Extracting Zip file -

C# WinForm - loading screen -