python - Regular expression extract and exclude data from string -
i have html string want extract data out of.
s="<ul><li>this bullet lev 1 </li><li><ul><li><strong> this</strong> bullet lev </li></ul></li><li> <ul><li><ul><li>this bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>
"
i want extract content of data containing <li> elements, these elements contain "this bullet lev 1 " between them , not contains other <li> in multilevel elements such
<li><ul><li><strong> this</strong> bullet lev </li></ul></li>
i have written regular expression that
<li>([\w &;/<>]*?)</li>
however ends pulling unwanted data well
<li>this bullet lev 1 </li> <li><ul><li><strong> this</strong> bullet lev </li> <li> <ul><li><ul><li>this bullet lev 3</li>
while want pull
<li>this bullet lev 1 </li> <li><strong> this</strong> bullet lev </li> <li> <ul><li><ul><li>this bullet lev 3</li>
the idea want exclude results have <li> in extracted data , move ahead.
from research understood have use lookahead or lookbehind , gave couple of tries no avail.
any clues? using python , builtin re module.
i think might job.
<li>((?!<li>).)*?</li>
should match <li>
followed </li>
, in between long don't contain <li>
(using lookahead)
this assumes don't want <li> <ul><li><ul><li>this bullet lev 3</li>
, rather: <li>this bullet lev 3</li>
, in examples, seems more consistent description.
that said, parser better idea sort of thing, speaking.
Comments
Post a Comment