python - Regular expression extract and exclude data from string -


i have html string want extract data out of.

s="<ul><li>this bullet lev 1&nbsp;</li><li><ul><li><strong>&nbsp;this</strong> bullet lev&nbsp;</li></ul></li><li>&nbsp;<ul><li><ul><li>this bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li> 

"

i want extract content of data containing <li> elements, these elements contain "this bullet lev 1 " between them , not contains other <li> in multilevel elements such

<li><ul><li><strong>&nbsp;this</strong> bullet lev&nbsp;</li></ul></li> 

i have written regular expression that

<li>([\w &;/<>]*?)</li> 

however ends pulling unwanted data well

<li>this bullet lev 1&nbsp;</li> <li><ul><li><strong>&nbsp;this</strong> bullet lev&nbsp;</li> <li>&nbsp;<ul><li><ul><li>this bullet lev 3</li> 

while want pull

<li>this bullet lev 1&nbsp;</li> <li><strong>&nbsp;this</strong> bullet lev&nbsp;</li> <li>&nbsp;<ul><li><ul><li>this bullet lev 3</li> 

the idea want exclude results have <li> in extracted data , move ahead.

from research understood have use lookahead or lookbehind , gave couple of tries no avail.

any clues? using python , builtin re module.

i think might job.

<li>((?!<li>).)*?</li> 

should match <li> followed </li> , in between long don't contain <li> (using lookahead)

this assumes don't want <li>&nbsp;<ul><li><ul><li>this bullet lev 3</li>, rather: <li>this bullet lev 3</li>, in examples, seems more consistent description.

that said, parser better idea sort of thing, speaking.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -