.Net Regex split with optional items and greediness -


here test sample of data i'm trying parse:

content1 *** content2 *** content3 content3 ***  content4 *** content5  *** *** content6 *** *** *** content7 

in short, data entries separated separator string on new line. want content1, content2, content3\r\ncontent3, \r\ncontent4, content5\r\n, content6, content7. tried simple regex (\r\n)?^\*\*\*$(\r\n)?, in regexbuddy's split '^$ match @ line breaks' option , .net flavour selected gives me correct output, albeit empty strings between multiple separators, understand have remove manually. however, following c# code:

string regexpattern = string.format(@"(\r\n)?^{0}$(\r\n)?", regex.escape("***")); var records = regex.split(document, regexpattern, regexoptions.multiline); 

returns me whole input string. wrong here?

also, don't understand why regex (\r\n)?\*\*\*(\r\n)?(which doesn't suit me, because entry can include separator string) returns me content1, \r\n, \r\n, content2, \r\n, etc. shouldn't \r\n items greedily grabbed? regexbuddy gives me correct output regex.

1. $ match?

the main problem $ matches before \n character, if there \r before \n, need specify in pattern. otherwise, no match returned. can test splitting string "\r$" on "content1\r\ncontent2\r\ncontent3" see effect.

just example show considered beginning of line ^ , end of line $ (under multiline option). first line original string (with new line , carriage return shown \n , \r respectively), , second line annotated position of zero-length string matched ^ , $.

***\r\nconte\rn\rt3\r\nco\nntent3 ^***\r$\n^conte\rn\rt3\r$\n^co$\n^ntent3$ 

to resolve problem, need test (optional) \r right before $. in solution below in section 3, test optional \r, since \r may not there if input file comes unix environment.

2. inclusion of captured text in result of regex.split:

from .net framework 2.0, regex.split split string @ delimiter, , also include captured text in result array.

to resolve above problem, need turn capturing groups (capture text + grouping property) non-capturing groups (?:pattern) (grouping property only).

3. conclusion

so solution is:

var records = regex.split(document, @"(?:\r?\n)?^[*]{3}\r?$\n?", regexoptions.multiline); 

demo on ideone

you can deal empty strings in result separately.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -