Guaranteed unicode or ascii backoff in Python 2.7 -
i'm big fan of unicodedammit module in beautifulsoup4, puts string firmly in unicode , html unescaping :
from bs4 import unicodedammit unicode_page = unicodedammit(raw_page, [suspected_encodings_if_any]).unicode_markup there cases mighty dammit fails, though, , returns empty string. want have kind of backoff ascii cases.
dammit uses chardet, no point in backing off that. (dammit looks iconv_codec library - have experience it?) what's best way of backing off ascii? try loses things, seems work:
def to_unicode_with_ascii_backoff(text): if isinstance(text, unicode): return text else: ud = unicodedammit(text).unicode_markup if ud: return ud else: return ''.join(i in text if ord(i) < 128)
"best" depends on application. incrementally improve function:
def to_unicode_with_ascii_backoff(text): u = unicodedammit(text).unicode_markup return u if u or not text else text.decode('ascii', 'replace') it returns unicode string or raises error if input not bytestring or unicode string.
Comments
Post a Comment