regex - JAVA: Regexp on user-defined tokens -
i need run regexp on set of user-defined tokens.
for example, i've string this: tok3 tok1 tok2 tok2 tok4 tok3
// example string
and using regexp this: (tok1|tok2)+
// regexp
i'd capture sequence of tokens: tok1 tok2 tok2
in example string.
now, regexp work on sequence of characters, problem different in sense tokens not characters strings. tokens composed 2 or more characters. furthermore, software should able detect regexp in example matches string @ position (1, 4).
for moment, solved problem mapping each token char in ascii alphabet , running regexp after removing spaces.
however, i'm not satsfied solution , wondering if there better one. thanks!
edit
spaces in regexp needed separate tokens. don't mean spaces mandatory between tokens.
how storing positions of spaces , using translate string position token position?
far elegant straight regex, it's idea.
treemap<integer, integer> spaces = new treemap<integer, integer>(); string regex = "(?<=^| )((tok1|tok2)( |$))+"; string str = "tok3 tok1 tok2 tok2 tok4 tok3"; int c = 0; spaces.put(0, 0); (int = 0; < str.length(); i++) { if (str.charat(i) == ' ') spaces.put(i, ++c); } pattern p = pattern.compile(regex); matcher m = p.matcher(str); while (m.find()) { system.out.println(m.group()); system.out.println("start = " + spaces.floorentry(m.start()).getvalue()); system.out.println("finish = " + spaces.floorentry(m.end()).getvalue()); }
another option string.split
:
string str = "tok3 tok1 tok2 tok2 tok4 tok3"; string[] arr = str.split(" "); // maybe consider using \\s or \\s+ instead int start = -1; string match = ""; (int = 0; < arr.length; i++) { if (arr[i].matches("(tok1|tok2)")) { if (start == -1) start = i; match += ((match.length() != 0) ? " " : "") + arr[i]; } else if (start != -1) { system.out.println(match); system.out.println("start = " + start); system.out.println("finish = " + i); match = ""; start = -1; } }
Comments
Post a Comment