Description
The fn:tokenize function splits a string based on a regular expression. The regular expression syntax used is defined by XML Schema with a few modifications/additions in XQueryXPath/XSLT 2.0. The $pattern argument is a regular expression that represents the separator. The simplest patterns can be a single space, or a string that contains the separator character, such as ,. However, certain characters must be escaped in regular expressions, namely .\?*+|^${}()[].
The separators are not included in the result strings. If two adjacent separators appear, a zero-length string is included in the result sequence. If the string starts with the separator, a zero-length string is the first value returned. Likewise, if the string ends with the separator, a zero-length string is the last value in the result sequence.
The $flags parameter allows for additional options in the interpretation of the regular expression. Flags and regular expressions are covered in detail in chapter 18 of the book XQuery.
If a particular point in the string could match more than one alternative, the first alternative is chosen. This is exhibited in the last example below, where the function considers the comma to be the separator, even though "comma plus asterisk" also applies.
For more examples of XQueryXPath/XSLT/XML Schema regular expressions, see this page. This description is © Copyright 2007, Priscilla Walmsley. It is excerpted from the book XQuery by Priscilla Walmsley, O'Reilly, 2007. For a complete explanation of this function, please refer to Appendix A of the book. Arguments and Return TypeName | Type | Description |
$input |
xs:string? |
the string to tokenize |
$pattern |
xs:string |
regular expression to match the delimiters |
$flags |
xs:string |
flags that control multiline mode, case insensitivity, etc. |
return value |
xs:string* |
ExamplesXPath Example | Results |
---|
tokenize(
'a b c', '\s') |
('a', 'b', 'c') |
tokenize(
'a b c', '\s') |
('a', '', '', 'b', 'c') |
tokenize(
'a b c', '\s+') |
('a', 'b', 'c') |
tokenize(
' b c', '\s') |
('', 'b', 'c') |
tokenize(
'a,b,c', ',') |
('a', 'b', 'c') |
tokenize(
'a,b,,c', ',') |
('a', 'b', '', 'c') |
tokenize(
'a, b, c', '[,\s]+') |
('a', 'b', 'c') |
tokenize(
'2006-12-25T12:15:00', '[\-T:]') |
('2006', '12', '25',
'12', '15', '00') |
tokenize(
'Hello, there.', '\W+') |
('Hello', 'there', '') |
tokenize(
(), '\s+') |
() |
tokenize(
'abc', '\s') |
abc |
tokenize(
'abcd', 'b?') |
Error FORX0003 |
tokenize(
'a,xb,xc', ',|,x') |
('a', 'xb', 'xc') |
See AlsoHistory |
Recommended Reading:
|