.Net Regex split with optional items and greediness -
here test sample of data i'm trying parse:
content1 *** content2 *** content3 content3 *** content4 *** content5 *** *** content6 *** *** *** content7
in short, data entries separated separator string on new line. want content1, content2, content3\r\ncontent3, \r\ncontent4, content5\r\n, content6, content7. tried simple regex (\r\n)?^\*\*\*$(\r\n)?
, in regexbuddy's split '^$ match @ line breaks' option , .net flavour selected gives me correct output, albeit empty strings between multiple separators, understand have remove manually. however, following c# code:
string regexpattern = string.format(@"(\r\n)?^{0}$(\r\n)?", regex.escape("***")); var records = regex.split(document, regexpattern, regexoptions.multiline);
returns me whole input string. wrong here?
also, don't understand why regex (\r\n)?\*\*\*(\r\n)?
(which doesn't suit me, because entry can include separator string) returns me content1, \r\n, \r\n, content2, \r\n, etc. shouldn't \r\n items greedily grabbed? regexbuddy gives me correct output regex.
1. $
match?
the main problem $
matches before \n
character, if there \r
before \n
, need specify in pattern. otherwise, no match returned. can test splitting string "\r$"
on "content1\r\ncontent2\r\ncontent3"
see effect.
just example show considered beginning of line ^
, end of line $
(under multiline
option). first line original string (with new line , carriage return shown \n
, \r
respectively), , second line annotated position of zero-length string matched ^
, $
.
***\r\nconte\rn\rt3\r\nco\nntent3 ^***\r$\n^conte\rn\rt3\r$\n^co$\n^ntent3$
to resolve problem, need test (optional) \r
right before $
. in solution below in section 3, test optional \r
, since \r
may not there if input file comes unix environment.
2. inclusion of captured text in result of regex.split
:
from .net framework 2.0, regex.split
split string @ delimiter, , also include captured text in result array.
to resolve above problem, need turn capturing groups (capture text + grouping property) non-capturing groups (?:pattern)
(grouping property only).
3. conclusion
so solution is:
var records = regex.split(document, @"(?:\r?\n)?^[*]{3}\r?$\n?", regexoptions.multiline);
you can deal empty strings in result separately.
Comments
Post a Comment