utf 8 - Unwanted replacement of html entities by BeautifulSoup -


i have html containing mml generating word documents using mathtype. have python script uses beautifulsoup prettify it, problem takes ∠ , turns actual byte sequence 0xe2 0x88 0xa0 ∠ symbol. problem because 0xe2 0x88 0xa0 won't display ∠ in browser. instead browser interprets series of latin characters. happening math entities well, such δ ∠ − +... etc.

i looked through beautifulsoup documentation , can see how turn entities byte sequences, i'm not using command; i'm using prettify(). , didn't see way in beautifulsoup documentation not turn entities byte sequences.

does know if there's setting in beautifulsoup tell not change entities byte sequences? hope because seems kind of dumb have undo damage after prettify runs :)

thanks in advance help!

i missed part of beautifulsoup documentation. default output formatters described behaviour: turn html entities unicode characters. so, behaviour can changed using different output formatter. (d'oh)

"you can change behavior providing value formatter argument prettify(), encode(), or decode()...."

so if pass in formatter="html" beautiful soup convert unicode characters html entities whenever possible! yay! thank beautiful soup!

(and have such great documentation. pity didn't read whole thing sooner. :$)


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -