.Net Regex split with optional items and greediness -
here test sample of data i'm trying parse:
content1 *** content2 *** content3 content3 *** content4 *** content5 *** *** content6 *** *** *** content7 in short, data entries separated separator string on new line. want content1, content2, content3\r\ncontent3, \r\ncontent4, content5\r\n, content6, content7. tried simple regex (\r\n)?^\*\*\*$(\r\n)?, in regexbuddy's split '^$ match @ line breaks' option , .net flavour selected gives me correct output, albeit empty strings between multiple separators, understand have remove manually. however, following c# code:
string regexpattern = string.format(@"(\r\n)?^{0}$(\r\n)?", regex.escape("***")); var records = regex.split(document, regexpattern, regexoptions.multiline); returns me whole input string. wrong here?
also, don't understand why regex (\r\n)?\*\*\*(\r\n)?(which doesn't suit me, because entry can include separator string) returns me content1, \r\n, \r\n, content2, \r\n, etc. shouldn't \r\n items greedily grabbed? regexbuddy gives me correct output regex.
1. $ match?
the main problem $ matches before \n character, if there \r before \n, need specify in pattern. otherwise, no match returned. can test splitting string "\r$" on "content1\r\ncontent2\r\ncontent3" see effect.
just example show considered beginning of line ^ , end of line $ (under multiline option). first line original string (with new line , carriage return shown \n , \r respectively), , second line annotated position of zero-length string matched ^ , $.
***\r\nconte\rn\rt3\r\nco\nntent3 ^***\r$\n^conte\rn\rt3\r$\n^co$\n^ntent3$ to resolve problem, need test (optional) \r right before $. in solution below in section 3, test optional \r, since \r may not there if input file comes unix environment.
2. inclusion of captured text in result of regex.split:
from .net framework 2.0, regex.split split string @ delimiter, , also include captured text in result array.
to resolve above problem, need turn capturing groups (capture text + grouping property) non-capturing groups (?:pattern) (grouping property only).
3. conclusion
so solution is:
var records = regex.split(document, @"(?:\r?\n)?^[*]{3}\r?$\n?", regexoptions.multiline); you can deal empty strings in result separately.
Comments
Post a Comment