str

Class str is a collection of static functions for string processing.

UTF-8 strings

int cmp(const char * s1, const char * s2): NULL-pointer safe front-end to strcmp(), NULL lower than non-NULL (even if addressing an empty string)
int icmp(const char * s1, const char * s2): case independent comparison of UTF-8 encoded strings
int n_icmp(const char * s1, const char * s2, dword num): case independent comparison of UTF-8 encoded strings, stops after 'num' bytes if no NUL byte is seen before
bool imatch_pattern(const char * pattern, dword len_pattern, const char * string, dword len_string): simple case independent pattern match for UTF-8 encoded strings - only ``*ДД as meta character, escaped by \2A
always longest match, double asterisk (e.g. "pre*mid*post"), always anchored
int coll(const char * s1, const char * s2): case dependent comparison of UTF-8 encoded strings according to collating sequence
int icoll(const char * s1, const char * s2): case independent comparison of UTF-8 encoded strings according to collating sequence
const char * next(const char * s, const char * s_end): scan for start of UTF-8 character past 's', return start > 's' and < 's_end', 's_end' if not found
const char * prev(const char * s, const char * s_1st): scan backward for start of UTF-8 character before 's', return start < 's' and >= 's_1st', 's_1st' is also returned if not found
dword ucs4_char(const char * s, const char * s_end, const char ** next): converts the UTF-8 character at 's' to UCS-4, the position of following byte is stored in 'next' if next is non-NULL
dword to_utf8(const char * in, char * out, dword len): copy UTF-8 characters from 'in' to 'out', append NUL, return number of bytes written before.
copying ends when NUL or a non-UTF8 character is seen.
if 'len' is does not permit to copy a complete UTF-8 character and NUL
copying ends with the last complete UTF-8 character
dword to_strx(const char * in, char * out, dword len): copy NUL terminated UTF-8 string from 'in' to 'out' assuming 'in' contains valid UTF-8 characters only
if 'len' is does not permit to copy a complete UTF-8 character and NUL
copying ends with the last complete UTF-8 character
dword utf8_cnt(const char * in): return number of UTF-8 characters in 'in'

UTF-8 / Latin-1 conversions

bool may_be_utf8(const char * in): check 'in' for valid UTF-8 encoding (plain ASCII is valid too)
bool n_may_be_utf8(const char * in, dword len): check 'len' bytes of 'in' for valid UTF-8 encoding (plain ASCII is valid too)
bool must_be_utf8(const char * in): check 'in' for valid UTF-8 encoding and at least one non-ASCII character
bool n_must_be_utf8(const char * in, dword len): check 'len' bytes of 'in' for valid UTF-8 encoding and at least one non-ASCII character
dword to_latin1(char * inout): convert UTF-8 to Latin-1 in place, append NUL, return number of bytes written before
dword to_latin1(const char * in, char * out, dword len): convert UTF-8 to Latin-1, append NUL, return number of bytes written before
dword to_latin1_transcribe(const char * in, char * out, dword len): convert UTF-8 to Latin-1, transscribe non Latin-1 codes if possible, append NUL, return number of bytes written before
dword to_latin1_xml(const char * in, char * out, dword len): convert UTF-8 to Latin-1 XML attribute, append NUL, return number of bytes written before (obsolete)
dword n_to_latin1(const char * in, dword num, char * out, dword len): convert 'num' bytes from UTF-8 to Latin-1, copy also NUL bytes, append NUL, return number of bytes written before
dword n_to_latin1_n(const char * in, dword cnt, char * out, dword len): convert 'num' bytes from UTF-8 to Latin-1, copy also NUL bytes, don't append NUL, return number of bytes written before
dword from_latin1(const char * in, char * out, dword len): convert Latin-1 to UTF-8, append NUL, return number of bytes written before
dword from_latin1_n(const char * in, dword num, char * out, dword len): convert 'num' bytes from Latin-1 to UTF-8, append NUL, return number of bytes written before
dword from_latin1_n_len(const char * in, dword num): calculate number of bytes required for UTF-8 encoding 'num' bytes of 'in'
dword n_from_latin1_n(const char * in, dword num, char * out, dword len): convert 'num' bytes from Latin-1 to UTF-8, copy also NUL bytes, don't append NUL, return number of bytes written
unsigned transcribe_to_basic_latin(char * buf, unsigned size_of_buf): convert an UTF-8 string to Latin-1, transcribe characters not available in Latin-1 as far as possible.

Latin-1 / UCS-2 conversions

dword ucs2_n_to_latin1(const word * in, dword num, char * out, dword len): convert 'num' UCS-2 words to Latin-1, stop at NUL word, append NUL byte, return number of bytes written
dword latin1_to_ucs2_n(const char * in, word * out, dword num): convert 'num' bytes from Latin-1 to UCS-2, stop at NUL byte, don't append NUL word, return number of words written

UCS-2 strings

int ucs2_cmp(const word * w1, const word * w2): binary comparison of NUL word terminated UCS-2 strings
int ucs2_cmp(const word * w1, dword w1_cnt, const word * w2, dword w2_cnt): binary comparison of counted UCS-2 strings, accept embedded NUL words
int ucs2_icmp(const word * w1, const word * w2): case independent comparison of NUL word terminated UCS-2 strings
int ucs2_icmp(const word * w1, dword w1_cnt, const word * w2, dword w2_cnt): case independent comparison of counted UCS-2 strings, accept embedded NUL words
int ucs2_coll(const word * w1, dword w1_cnt, const word * w2, dword w2_cnt): case dependent comparison of counted UUCS-2 encoded strings according to collating sequence
int ucs2_icoll(const word * w1, dword w1_cnt, const word * w2, dword w2_cnt): case independent comparison of counted UUCS-2 encoded strings according to collating sequence
dword ucs2_cnt(const word * in): return number of non-NUL words in 'in'
dword ucs2_to_ucs2(const word * in, word * out, dword cnt): copy NUL word terminated UCS-2 string from 'in' to 'out', append NUL word, return number of words written before
word ucs2_chr2lwr(word w): convert UCS-2 character to lower case UCS-2
word ucs2_chr2upr(word w): convert UCS-2 character to upper case UCS-2
dword ucs2_to_le_n(const word * in, byte * out, dword cnt): write 'cnt' UCS-2 words from 'in' to 'out' in little endian byte order, return number of bytes written (not words!)
dword ucs2_to_net_n(const word * in, byte * out, dword cnt): write 'cnt' UCS-2 words from 'in' to 'out' in network byte order, return number of bytes written (not words!)

UTF-8 / UCS-2 conversions

dword to_ucs2(const char * in, word * out, dword cnt): convert UTF-8 string to UCS-2, append NUL word, return number of words written before
dword to_ucs2_n(const char * in, word * out, dword cnt): convert UTF-8 string to UCS-2, don't append NUL word, return number of words written
dword n_to_ucs2(const char * in, dword num, word * out, dword cnt): convert 'num' bytes from UTF-8 to UCS-2, don't append NUL word, return number of words written
dword from_ucs2(const word * in, char * out, dword len): convert NUL terminated UCS-2 to UTF-8, append NUL, return number of bytes written before
dword from_ucs2_n(const word * in, dword num, char * out, dword len): convert 'cnt' UCS-2 words (including NUL words) to UTF-8, utf8, append NUL, return number of bytes written before

UTF-8 / URL conversions

dword to_url(const char * in, char * out, dword len): convert UTF-8 string to URL-encoded string, append NUL, return number of bytes written before
dword to_url_cfg(const char * in, char * out, dword len): convert UTF-8 string to URL-encoded string suitable as config line argument, append NUL, return number of bytes written before.
encoding includes the config line line syntax characters '%', '<', '>', '{', '}', '\r', '\n'.
dword from_url(const char * in, char * out, dword len): convert URL-encoded string to UTF-8 string, append NUL, return number of bytes written before
dword from_url(char * inout): convert URL-encoded to UTF-8 string string in place, append NUL, return number of bytes written before.

UTF-8 / Punycode conversions

dword to_punycode(const char * in, char * out, dword len): convert UTF-8 string tp
dword from_punycode(const char * in, char * out, dword len): not implemented yet

Config Line Options

'case' in function names means ASCII code case indepenent

char * args_find(int argc, char * argv[], const char * arg)

find string 'arg' in 'argv[]', if found return next 'argv[]' entry if existent and not starting with '/', otherwise an empty string.
usually 'arg' should contain a leading '/', it's not implied.

char * parse_args(int argc, char * argv[], int & i, const char * args_tbl[], int & index, char * & value, int * arg=0, byte decode=0)

find option value for an option name listed in 'args_tbl[]'.
an option name "opt-name" in args_tbl[] matches as well "/opt-name" as "/opt-name." in 'argv[]'.

int argc: number of entries in argument vector
char * argv[]: argument vector
int & i: starting index in argument vector, index of next option name in argument vector after return
char * args_tbl[]: NULL terminated list of option names, a leading '/' is implied and must not be specified
int & index: returns the index of the matching option in 'args_tbl[]' or -1 if there was no match
char * & value: points to begin of (possibley decoded) option value
int * arg: returns index of matching option name in 'argv[]'
byte decode: decode flags: PARSE_ARGS_DECODE_URL and/or PARSE_ARGS_STRIP_EXCESSIVE_SPACES

bool match(const char * ref, const char * s, char ** ptr)

if 'ptr' is zero match() is identical to (0 == strcmp(ref,s))
if 'ptr' is nonzero match() returns true on a match of 'ref' and the head of 's' with 'ptr' pointing to next character in 's'

bool casematch(const char * ref, const char * s, char ** ptr)

if 'ptr' is zero casematch() is identical to (0 == (strcasecmp(ref,s) == 0))
if 'ptr' is nonzero casematch() returns true on a case insensitive match of 'ref' and the head of 's' with 'ptr' pointing to next character in 's'

int casecmp(const char * s1, const char * s2)

identical to strcasecmp() (may be missing in standard library)

int n_casecmp(const char * s1, const char * s2, dword num)

identical to strncasecmp() (may be missing in standard library)

void caselwr(char * inout)

convert characters in 'inout' to lower case

void n_caselwr(char * inout, dword num)

convert 'num' characters in 'inout' to lower case

void caseupr(char * inout)

convert characters in 'inout' to upper case

void n_caseupr(char * inout, dword num)

convert 'num' characters in 'inout' to upper case

char chr2lwr(char c)

return 'c' converted to lower case

char chr2upr(char c)

return 'c' converted to upper case

Latin-1 strings

int latin1_icmp(const char * s1, const char * s2): case independent comparison of Latin-1 encoded strings
int latin1_n_icmp(const char * s1, const char * s2, dword num): case independent comparison of Latin-1 encoded strings, stops after 'num' characters if no NUL character is seen before
bool latin1_imatch_pattern(const char * pattern, dword len_pattern, const char * string, dword len_string): simple case independent pattern match for Latin-1 encoded strings - only ``*ДД as meta character, escaped by \2A
always longest match, double asterisk (e.g. "pre*mid*post"), always anchored
char latin1_chr2lwr(char c): return 'c' converted to lower case
char latin1_chr2upr(char c): return 'c' converted to upper case

Plain strings

int diff(const char * s1, const char * s2): return -1 if the strings 's1' and 's2' are identical, otherwise return the offset of the first first difference
dword to_hexmem(const char * s, byte * mem, dword len): convert the hexadecimal characters (any case) in 's' to their binary representation.
stop at NUL or after 'len' bytes are written or when a non-hexadecimal character is seen.
the low order part of last byte is set to zero if not given in 's'.
dword to_hexmem(const char * s, char ** ptr, byte * mem, dword len, bool fill=true): convert the hexadecimal characters (any case) in 's' to their binary representation.
stop at NUL or after 'len' bytes are written or when a non-hexadecimal character is seen.
the low order part of last byte is set to zero if not given in 's'.
if 'fill' is true the remaining bytes of 'mem' are zero filled.
if 'ptr' is nonzero it points to first not converted character in 's' after return.
char * from_hexmem(const byte * mem, const dword len, char * s): convert 'len' bytes of 'mem' to hexadecmal characters in 's', append NUL, return 's'
byte chr2hexval(char c): return binary value assigned to hexadecimal character 'c' (any case), return 0xff if 'c' is not a hexadecimal character
dword to_str(const char * in, char * out, dword len): copy characters from 'in' to 'out', append NUL, return number of characters written before.
copying ends when NUL is seen or 'len' -1 characters are written.
Note: For UTF-8 strings use str::to_strx instead.
dword to_xml(const char * in, char * out, dword len): copy 'in' to 'out', replace XML syntax characters by entity reference, append NUL, return number of bytes written before
ulong64 to_id(const char * str): copy the characters from 'in' to a 64 bit integer, stop at NUL or after 8 characters, fill up with 0 bytes if necessary
bool to_tm(const char * s, char ** ptr, struct tm & tm): converts a string of the format tt.mm.yy-hh.mm.ss to a struct tm, return true if the format was accepted.
if 'ptr' is nonzero it points to first not converted character in 's' after return.
unsigned to_time_iso8601(time_t time_gmt, char * out, dword out_len): converts a GMT time_t to ISO-8601, e.g. "2005-02-15T11:26:44Z".
out buffer is zero terminated.
The length of out is returned.
unsigned to_time_rfc1123(time_t time_gmt, char * out, dword out_len): converts a GMT time_t to ISO-8601, e.g. "Sun, 06 Nov 1994 08:49:37 GMT".
out buffer is zero terminated.
The length of out is returned.
time_t from_time_iso8601(const char * str): converts an ISO-8601 string to time_t, e.g. "2005-02-15T07:54:34Z", "2002-11-26T20:27:11.000Z" or "2005-03-15T07:39:42Z-02:35".
time_t from_time_rfc1123(const char * str): converts an RFC 1123 string to time_t, e.g. "Sun, 06 Nov 1994 08:49:37 GMT".
time_t from_time_rfc1036(const char * str): converts an RFC 1036 string to time_t, e.g. "Sunday, 06-Nov-94 08:49:37 GMT".
time_t from_time_ansi(const char * str): converts an ANSI string to time_t, e.g. "Sun Nov 6 08:49:37 1994".
bool is_dial_string(const char * s): returns true if 's' contains only characters accepted as dialable digits by innovaphone devices
dword from_ie_number(const byte * ie, char * out, dword len): copy the numer part of info element 'ie' to out, append NUL, return number of bytes written before
bool is_true(const char * s): returns true if 's' is nonzeroro and either "true" or "on"
dword n_len(const char * in, dword len): return number of non-NUL characters in 'in' or 'len' if no NUL character is found up to 'in' + 'len' - 1
char * strip_whitespace(char * in): replace trailing whitespace charcters in 'in' by NUL, return address of first non-whitespace character in 'in'
void replace(const char * in, char * out, dword len, const char * placeholder, const char * replace): replaces the first occurrence of 'placeholder' in the 'in' buffer by 'replace' and writes the result to the out buffer. The 'out' buffer will always be null-terminated.
dword split(char * in, char * tokens[], dword max_tokens, const char * separator): Splits the string in the 'in' buffer into tokens, using a specified 'separator' string to determine where to make each split. The resulting tokens are stored in the 'tokens' array. 'max_tokens' specifies the size of the 'tokens' array. If 'separator' is null or empty, a single token containing the whole string will be returned. Note that the 'in' buffer is modified by this function call. Returns the number of tokens stored in the 'token' array.
dword join(char * out, dword len, char * tokens[], dword num_tokens, const char * separator): Joins the strings in the 'tokens' array into a single string using a specified 'separator' string. The result is written to the 'out' buffer. Returns the length of the string written to the 'out' buffer.
char * escape_quoted(char * in, char * &out, unsigned len): copy 'in' to 'out', insert a backslash before each backslash, single or double quote read from 'in',
append NUL, set 'out' to position after NUL, return initial value of 'out'.
char * escape_quoted_printable(char * in, char * out, unsigned out_size, bool escape_q_string_chars = false): Encodes the input buffer using the Quoted-Printable encoding as defined in RFC 2045, but without inserting additional line breaks.
If escape_q_string_chars is true, additionally '?' and '_' are escaped, as needed in the "Q" encoding defined in RFC 1342.
dword fnv1a_hash(const char *s): return the 32 bit FNV-1a hash over the characters in 's' not including the terminating NUL
dword fnv1a_hash(const byte *s, word length): return the 32 bit FNV-1a hash over 'length' bytes of 's' including NUL bytes