1 {** Implementación de expresiones regulares para Delphi}
\r
8 Regular Expressions for Delphi
\r
15 anso@mail.ru, anso@usa.net
\r
17 http://anso.virtualave.net
\r
19 This library is derived from Henry Spencer sources.
\r
20 I translated the C sources into Object Pascal,
\r
21 implemented object wrapper and some new features.
\r
22 Many features suggested or partially implemented
\r
23 by TRegExpr's users (see Gratitude below).
\r
26 ---------------------------------------------------------------
\r
28 ---------------------------------------------------------------
\r
29 Copyright (c) 1999-00 by Andrey V. Sorokin <anso@mail.ru>
\r
31 This software is provided as it is, without any kind of warranty
\r
32 given. Use it at your own risk.
\r
34 You may use this software in any kind of development, including
\r
35 comercial, redistribute, and modify it freely, under the
\r
36 following restrictions :
\r
37 1. The origin of this software may not be mispresented, you must
\r
38 not claim that you wrote the original software. If you use
\r
39 this software in any kind of product, it would be appreciated
\r
40 that there in a information box, or in the documentation would
\r
41 be an acknowledgmnent like this
\r
42 Partial Copyright (c) 2000 by Andrey V. Sorokin
\r
43 2. You may not have any income from distributing this source
\r
44 to other developers. When you use this product in a comercial
\r
45 package, the source may not be charged seperatly.
\r
48 ---------------------------------------------------------------
\r
49 Legal issues for the original C sources:
\r
50 ---------------------------------------------------------------
\r
51 * Copyright (c) 1986 by University of Toronto.
\r
52 * Written by Henry Spencer. Not derived from licensed software.
\r
54 * Permission is granted to anyone to use this software for any
\r
55 * purpose on any computer system, and to redistribute it freely,
\r
56 * subject to the following restrictions:
\r
57 * 1. The author is not responsible for the consequences of use of
\r
58 * this software, no matter how awful, even if they arise
\r
59 * from defects in it.
\r
60 * 2. The origin of this software must not be misrepresented, either
\r
61 * by explicit claim or by omission.
\r
62 * 3. Altered versions must be plainly marked as such, and must not
\r
63 * be misrepresented as being the original software.
\r
66 ---------------------------------------------------------------
\r
68 ---------------------------------------------------------------
\r
70 found and fixed ugly bug in big string processing
\r
72 testing in CPPB and suggesting/implementing many features
\r
74 implemented Offset parameter
\r
78 Implemented UniCode support, found and fixed some bugs
\r
80 Implemented some features, many optimization suggestions
\r
82 And many others - for big work of bug hunting !
\r
84 I am still looking for person who can help me to translate
\r
85 this documentation into other languages (especially German)
\r
88 ---------------------------------------------------------------
\r
90 ---------------------------------------------------------------
\r
92 -=- VCL-version of TRegExpr - for dummies ;) and TRegExprEdit
\r
93 (replacement for TMaskEdit).
\r
94 Actually, I am writing non-VCL aplications (with web-based
\r
95 interfaces), so I don't need VCL's TRegExpr for myself.
\r
96 Will it be really usefull ?
\r
98 -=- full functionallity of braces {}
\r
100 -=- working with pascal-style string.
\r
101 Now pascal-strings converted into PChar, so
\r
102 you can't find r.e. in strings with #0 -chars.
\r
103 (suggested by Pavel O).
\r
105 -=- non-greedy style (suggested by Martin Baur)
\r
107 -=- put precalculated lengths into EXACTLY[CI] !
\r
109 -=- fInputString as string (suggested by Ralf Junker)
\r
111 -=- Add regstart optimization for case-insensitive mode ?
\r
112 Or complitely remove because FirstCharSet is faster ?
\r
114 -=- "Russian Ranges" --> National ranges (use property WordChars ?
\r
115 for ordering letters in ranges by its order in WirdsChars if modifier /r is On)
\r
117 -=- FirstCharSet as array [#0 .. #255] of REChar ?
\r
118 (2x faster then set of REChar)
\r
120 -=- p-code optimization (remove BRANCH-to-EEND, COMMENT, BACK(?)
\r
121 merge EXACTLY etc).
\r
123 I need your suggestions !
\r
124 What are more importent in this list ?
\r
125 Did I forget anything ?
\r
128 ---------------------------------------------------------------
\r
130 ---------------------------------------------------------------
\r
134 (^) upgraded implementation
\r
136 v. 0.938 2000.07.23
\r
137 -=- (^) Exeptions now jump to appropriate source line, not
\r
138 to Error procedure (I am not quite sure this is safe for
\r
139 all compiler versions. You can turn it off - remove
\r
140 reRealExceptionAddr definition below).
\r
141 -=- (^) Forgotten BSUBEXP[CI] in FillFirstCharSet caused
\r
142 exeption 'memory corruption' in case if back reference can
\r
143 be first op, like this: (a)*\1 (first subexpression can be
\r
144 skipped and we'll start matching with back reference..).
\r
146 v. 0.937 2000.06.12
\r
147 -=- (-) Bug in optimization engine (since v.0.934). In some cases
\r
148 TRegExpr didn't catch right strings.
\r
149 Thanks to Matthias Fichtner
\r
151 v. 0.936 2000.04.22
\r
152 -=- (+) Back references, like <font size=(['"]?)(\d+)\1>, see
\r
154 -=- (+) Wide hex char support, like '\x{263a}'
\r
156 v. 0.935 2000.04.19 (by Yury Finkel)
\r
157 -=- (-) fInvertCase now isn't readonly ;)
\r
158 -=- (-) UniCode mode compiling errors
\r
160 v. 0.934 2000.04.17
\r
161 -=- (^) New ranges implementation (range matching now is very fast
\r
162 - uses one(!) CPU instruction)
\r
163 -=- (^) Internal p-code structure converted into 32-bits - works
\r
164 faster and now there is no 64K limit for compiled r.e.
\r
165 -=- (^) '{m,n}' now use 32-bits arguments (up to 2147483646) - specially
\r
166 for Dmitry Veprintsev ;)
\r
167 -=- (^) Ranges now support metachars: [\n-\x0D] -> #10,#11,#12,#13;
\r
168 Changed '-' processing, now it's like in Perl:
\r
169 [\d-t] -> '0'..'9','-','t'; []-a] -> ']'..'a'
\r
170 -=- (-) Bug with \t and etc macro (they worked only in ranges)
\r
171 Thanks to Yury Finkel
\r
172 -=- (^) Added new preprocessing optimization (see FirstCharSet).
\r
173 Incredible fast (!). But be carefull it isn's properly tested.
\r
174 You can switch it Off - remove UseFirstCharSet definition.
\r
175 -=- (^) Many other speed optimizations
\r
176 -=- (-) Case-insensitive mode now support system-defined national
\r
177 charset (due to bug in v.0.90 .. 0.926 supported only english one)
\r
178 -=- (^) Case-insensitive mode implemented with InvertCase (param &
\r
179 result of REChar type) - works 10 .. 100 times faster.
\r
180 -=- (^) Match and ExecNext interfaces optimized, added IsProgrammOk
\r
182 -=- (^) Increased NSUBEXP (now 15) and fixed code for this, now you
\r
183 can simply increase NSUBEXP constant by yourself.
\r
184 Suggested by Alexander V. Akimov.
\r
185 -=- (^+) Substitute adapted for NSUBEXP > 10 and significant (!)
\r
186 optimized, improved error checking.
\r
187 ATTENTION! Read new Substitute description - syntax was changed !
\r
188 -=- (+) SpaceChars & WordChars property - now you may change chars
\r
189 treated as \s & \w. By defauled assigned RegExprSpaceChars/WordChars
\r
190 -=- (+) Now \s and \w supported in ranges
\r
191 -=- (-) Infinite loop if end of range=#$FF
\r
192 Thanks to Andrey Kolegov
\r
193 -=- (+) Function QuoteRegExprMetaChars (see description)
\r
194 -=- (+) UniCode support - sorry, works VERY slow (remove '.' from
\r
195 {.$DEFINE UniCode} after this comment for unicode version).
\r
196 Implemented by Yury Finkel
\r
198 v. 0.926 2000.02.26
\r
199 -=- (-) Old bug derived from H.Spencer sources - SPSTART was
\r
200 set for '?' and '*' instead of '*', '{m,n}' and '+'.
\r
201 -=- (-^) Now {m,n} works like Perl's one - error occures only
\r
202 if m > n or n > BracesMax (BracesMax = 255 in this version).
\r
203 In other cases (no m or nondigit symbols in m or n values,
\r
204 or no '}') symbol '{' will be compiled as literal.
\r
205 Note: so, you must include m value (use {0,n} instead of {,n}).
\r
206 Note: {m,} will be compiled as {m,BracesMax}.
\r
207 -=- (-^) CaseInsensitive mode now support ranges
\r
208 '(?i)[a]' == '[aA]'
\r
209 -=- (^) Roman-number template in TestRExp ;)
\r
210 -=- (+^) Beta version of complex-braces - like ((abc){1,2}|d){3}
\r
211 By default its turned off. If you want take part in beta-testing,
\r
212 please, remove '.' from {.$DEFINE ComplexBraces} below this comments.
\r
213 -=- (-^) Removed \b metachar (in Perl it isn't BS as in my implementation,
\r
215 -=- (+) Add /s modifier. Bu I am not sure that it's ok for Windows.
\r
216 I implemented it as [^\n] for '.' metachar in non-/s mode.
\r
217 But lines separated by \n\r in windows. I need you suggestions !
\r
218 -=- (^) Sorry, but I had to rename Modifiers to ModifierStr
\r
219 (ModifierS uses for /s now)
\r
222 -=- (^) some changes in documentation and demo-project.
\r
225 -=- (+) implemented braces repetitions {min,max}.
\r
226 Sorry - only simple cases now - like '\d{2,3}'
\r
227 or '[a-z1-9]{,7}', but not (abc){2,3} ..
\r
228 I still too short in time.
\r
229 Wait for future versions of TRegExpr or
\r
230 implement it by youself and share with me ;)
\r
231 -=- (+) implemented case-insensitive modifier and way
\r
232 to work with other modifiers - see properties
\r
233 Modifiers, Modifier, ModifierI
\r
234 and (?ismx-ismx) Perl extension.
\r
235 You may use global variables RegExpr* for assigning
\r
236 default modifier values.
\r
237 -=- (+) property ExtSyntaxEnabled changed to 'r'-modifier
\r
238 (russian extensions - see documentation)
\r
239 -=- (+) implemented (?#comment) Perl extension - very hard
\r
240 and usefull work ;)
\r
241 -=- (^) property MatchCount renamed to SubExprMatchCount.
\r
242 Sorry for any inconvenients, but it's because new
\r
243 version works slightly different and if you used
\r
244 MatchCount in your programms you have to rethink
\r
245 it ! (see comments to this property)
\r
246 -=- (+) add InputString property - stores input string
\r
247 from last Exec call. You may directly assign values
\r
248 to this property for using in ExecPos method.
\r
249 -=- (+) add ExecPos method - for working with assigned
\r
250 to InputString property. You may use it like this
\r
251 InputString := AString;
\r
254 InputString := AString;
\r
256 Note: ExecPos without parameter works only in
\r
257 Delphi 4 or higher.
\r
258 -=- (+) add ExecNext method - simple and fast (!) way to finding
\r
259 multiple occurences of r.e. in big input string.
\r
260 -=- (^) Offset parameter removed from Exec method, if you
\r
261 used it in your programs, please replace all
\r
262 Exec (AString, AOffset)
\r
264 InputString := AString; ExecPos (AOffset)
\r
265 Sorry for any inconvenients, but old design
\r
266 (see v.0.81) was too ugly :(
\r
267 In addition, multiple Exec calls with same input
\r
268 string produce fool overhead because each Exec
\r
269 reallocate input string buffer.
\r
270 -=- (^) optimized implementation of Substitution,
\r
271 Replace and Split methods
\r
272 -=- (-) fixed minor bug - if r.e. compilation raise error
\r
273 during second pass (!!! I think it's impossible
\r
274 in really practice), TRegExpr stayed in 'compiled'
\r
276 -=- (-) fixed bug - Dump method didn't check program existance
\r
277 and raised 'access violation' if previouse Exec
\r
278 was finished with error.
\r
279 -=- (+) changed error handling (see functions Error, ErrorMsg,
\r
280 LastError, property CompilerErrorPos, type ERegExpr).
\r
281 -=- (-^) TRegExpr.Replace, Split and ExecNext made a infinite
\r
282 loop in case of r.e. match empty-string.
\r
283 Now ExecNext moves by MatchLen if MatchLen <> 0
\r
284 and by +1 if MatchLen = 0
\r
285 Thanks to Jon Smith and George Tasker for bugreports.
\r
286 -=- (-) While playing with null-matchs I discovered, that
\r
287 null-match at tail of input string is never found.
\r
288 Well, I fixed this, but I am not sure this is safe
\r
289 (MatchPos[0]=length(AInputString)+1, MatchLen = 0).
\r
290 Any suggetions are very appreciated.
\r
291 -=- (^) Demo project and documentation was upgraded
\r
292 -=- (^) Documentation and this version was published on my home page
\r
296 v. 0.81 1999.12.25 // Merry Christmas ! :)
\r
297 -=- added \s (AnySpace) and \S (NotSpace) meta-symbols
\r
298 - implemented by Stephan Klimek with minor fixes by AVS
\r
299 -=- added \f, \a and \b chars (translates into FF, BEL, BS)
\r
300 -=- removed meta-symbols 'ö' & 'Ö' - sorry for any inconvenients
\r
301 -=- added Match property (== copy (InputStr, MatchPos [Idx], MatchLen [Idx]))
\r
302 -=- added extra parameter Offset to Exec method
\r
303 (thanks to Steve Mudford)
\r
306 -=- fixed bug - in some cases the r.e. [^...]
\r
307 incorrectly processed (as any symbol)
\r
308 (thanks to Jan Korycan)
\r
309 -=- Some changes and improvements in TestRExp.dpr
\r
311 v. 0.6 1999.08.13 (Friday 13 !)
\r
312 -=- changed header of TRegExpr.Substitute
\r
313 -=- added Split, Replace & appropriate
\r
314 global wrappers (thanks to Stephan Klimek for suggetions)
\r
317 -=- TRegExpr.Substitute routine added
\r
318 -=- Some changes and improvements in TestRExp.dpr
\r
319 -=- Fixed bug in english version of documentation
\r
320 (Thanks to Jon Buckheit)
\r
323 -=- Fixed bug with parsing of strings longer then 255 bytes
\r
324 (thanks to Guido Muehlwitz)
\r
325 -=- Fixed bug in RegMatch - mathes only first occurence of r.e.
\r
326 (thanks to Stephan Klimek)
\r
329 -=- ExecRegExpr function
\r
332 -=- packed into object-pascal class
\r
333 -=- code slightly rewriten for pascal
\r
334 -=- now macro correct proceeded in ranges
\r
335 -=- r.e.ranges syntax extended for russian letters ranges:
\r
336 à-ÿ - replaced with all small russian letters (Win1251)
\r
337 À-ß - replaced with all capital russian letters (Win1251)
\r
338 à-ß - replaced with all russian letters (Win1251)
\r
339 -=- added macro '\d' (opcode ANYDIGIT) - match any digit
\r
340 -=- added macro '\D' (opcode NOTDIGIT) - match not digit
\r
341 -=- added macro '\w' (opcode ANYLETTER) - match any english letter or '_'
\r
342 -=- added macro '\W' (opcode NOTLETTER) - match not english letter or '_'
\r
343 (all r.e.syntax extensions may be turned off by flag ExtSyntax)
\r
346 first version, with bugs, without help => must die :(
\r
350 {$DEFINE DebugRegExpr} // define for dump/trace enabling
\r
352 {$DEFINE reRealExceptionAddr} // if defined then exceptions will
\r
353 // jump to appropriate source line, not to Error procedure
\r
355 {.$DEFINE ComplexBraces} // define for beta-version of braces
\r
356 // (in stable version it works only for simple cases)
\r
358 {.$DEFINE UniCode} // define for Unicode support
\r
360 {$IFNDEF UniCode} // optionts applicable only for non-UniCode
\r
361 {$DEFINE UseSetOfChar} // Significant optimization by using set of char
\r
364 {$IFDEF UseSetOfChar}
\r
365 {$DEFINE UseFirstCharSet} // Significant optimization inm some cases
\r
370 // Determine version (for using 'params by default')
\r
371 {$IFNDEF VER80} { Delphi 1.0}
\r
372 {$IFNDEF VER90} { Delphi 2.0}
\r
373 {$IFNDEF VER93} { C++Builder 1.0}
\r
374 {$IFNDEF VER100} { Borland Delphi 3.0}
\r
375 {$DEFINE D4_} { Delphi 4.0 or higher}
\r
380 {.$IFNDEF VER110} { Borland C++Builder 3.0}
\r
381 {.$IFNDEF VER120} {Borland Delphi 4.0}
\r
385 Classes, // TStrings in Split method
\r
386 SysUtils; // Exception
\r
391 PRegExprChar = PWideChar;
\r
392 RegExprString = WideString;
\r
395 PRegExprChar = PChar;
\r
396 RegExprString = string;
\r
399 TREOp = REChar; // internal p-code type //###0.933
\r
401 TRENextOff = integer; // internal Next "pointer" (offset to current p-code) //###0.933
\r
402 PRENextOff = ^TRENextOff; // used for extracting Next "pointers" from compiled r.e. //###0.933
\r
403 TREBracesArg = integer; // type of {m,n} arguments
\r
404 PREBracesArg = ^TREBracesArg;
\r
407 REOpSz = SizeOf (TREOp) div SizeOf (REChar); // size of p-code in RegExprString units
\r
408 RENextOffSz = SizeOf (TRENextOff) div SizeOf (REChar); // size of Next 'pointer' -"-
\r
409 REBracesArgSz = SizeOf (TREBracesArg) div SizeOf (REChar); // size of BRACES arguments -"-
\r
412 TRegExprInvertCaseFunction = function (const Ch : REChar) : REChar
\r
416 RegExprModifierI : boolean = False;
\r
417 // default value for ModifierI
\r
419 RegExprModifierR : boolean = True;
\r
420 // default value for ModifierR
\r
422 RegExprModifierS : boolean = True;
\r
423 // default value for ModifierS
\r
425 RegExprSpaceChars : RegExprString = // chars for /s & /S
\r
426 ' '#$9#$A#$D#$C; // default for SpaceChars property
\r
428 RegExprWordChars : RegExprString = // chars for /w & /W
\r
429 // Ampliado por Leandro Lucarella (11/00)
\r
430 'abcdefghijklmnopqrstuvwxyzáéíóúýäëïöüÿàèìòùâêîôûñç'
\r
431 + 'ABCDEFGHIJKLMNOPQRSTUVWXYZÁÉÍÓÚÝÄËÏÖÜ
\9fÀÈÌÒÙÂÊÎÔÛÑÇ_'; // default for WordChars property
\r
435 NSUBEXP = 15; // max number of subexpression //###0.929
\r
436 // Be carefull - don't use values which overflow CLOSE opcode
\r
437 // (in this case you'll get compiler erorr).
\r
438 // Big NSUBEXP will cause more slow work and more stack required
\r
439 MaxBracesArg = $7FFFFFFF - 1; // max value for {n,m} arguments //###0.933
\r
440 {$IFDEF ComplexBraces}
\r
441 LoopStackMax = 10; // max depth of loops stack //###0.925
\r
444 // if range includes more then TinySetLen chars, //###0.934
\r
445 // then use full (32 bytes) ANYOFFULL instead of ANYOF[BUT]TINYSET
\r
446 // !!! Attension ! If you change TinySetLen, you must
\r
447 // change code marked as "//!!!TinySet"
\r
452 {$IFDEF UseSetOfChar}
\r
453 PSetOfREChar = ^TSetOfREChar;
\r
454 TSetOfREChar = set of REChar;
\r
457 {** Clase auxiliar que implementa expresiones Regulares. Se ultiliza para validar los campos de texto}
\r
460 startp : array [0 .. NSUBEXP - 1] of PRegExprChar; // founded expr starting points
\r
461 endp : array [0 .. NSUBEXP - 1] of PRegExprChar; // founded expr end points
\r
463 {$IFDEF ComplexBraces}
\r
464 LoopStack : array [1 .. LoopStackMax] of integer; // state before entering loop
\r
465 LoopStackIdx : integer; // 0 - out of all loops
\r
468 // The "internal use only" fields to pass info from compile
\r
469 // to execute that permits the execute phase to run lots faster on
\r
471 regstart : REChar; // char that must begin a match; '\0' if none obvious
\r
472 reganch : REChar; // is the match anchored (at beginning-of-line only)?
\r
473 regmust : PRegExprChar; // string (pointer into program) that match must include, or nil
\r
474 regmlen : integer; // length of regmust string
\r
475 // Regstart and reganch permit very fast decisions on suitable starting points
\r
476 // for a match, cutting down the work a lot. Regmust permits fast rejection
\r
477 // of lines that cannot possibly match. The regmust tests are costly enough
\r
478 // that regcomp() supplies a regmust only if the r.e. contains something
\r
479 // potentially expensive (at present, the only such thing detected is * or +
\r
480 // at the start of the r.e., which can involve a lot of backup). Regmlen is
\r
481 // supplied because the test in regexec() needs it and regcomp() is computing
\r
483 {$IFDEF UseFirstCharSet} //###0.929
\r
484 FirstCharSet : TSetOfREChar;
\r
487 // work variables for Exec's routins - save stack in recursion}
\r
488 reginput : PRegExprChar; // String-input pointer.
\r
489 fInputStart : PRegExprChar; // Pointer to first char of input string.
\r
490 fInputEnd : PRegExprChar; // Pointer to char AFTER last char of input string
\r
492 // work variables for compiler's routines
\r
493 regparse : PRegExprChar; // Input-scan pointer.
\r
494 regnpar : integer; // count.
\r
496 regcode : PRegExprChar; // Code-emit pointer; @regdummy = don't.
\r
497 regsize : integer; // Code size.
\r
499 regexpbeg : PRegExprChar; // only for error handling. Contains
\r
500 // pointer to beginning of r.e. while compiling
\r
501 fExprIsCompiled : boolean; // true if r.e. successfully compiled
\r
503 // programm is essentially a linear encoding
\r
504 // of a nondeterministic finite-state machine (aka syntax charts or
\r
505 // "railroad normal form" in parsing technology). Each node is an opcode
\r
506 // plus a "next" pointer, possibly plus an operand. "Next" pointers of
\r
507 // all nodes except BRANCH implement concatenation; a "next" pointer with
\r
508 // a BRANCH on both ends of it is connecting two alternatives. (Here we
\r
509 // have one of the subtle syntax dependencies: an individual BRANCH (as
\r
510 // opposed to a collection of them) is never concatenated with anything
\r
511 // because of operator precedence.) The operand of some types of node is
\r
512 // a literal string; for others, it is a node leading into a sub-FSM. In
\r
513 // particular, the operand of a BRANCH node is the first node of the branch.
\r
514 // (NB this is *not* a tree structure: the tail of the branch connects
\r
515 // to the thing following the set of BRANCHes.) The opcodes are:
\r
516 programm : PRegExprChar; // Unwarranted chumminess with compiler.
\r
518 fExpression : PRegExprChar; // source of compiled r.e.
\r
519 fInputString : PRegExprChar; // input string
\r
521 fLastError : integer; // see Error, LastError
\r
523 fModifiers : integer; // modifiers
\r
524 fCompModifiers : integer; // compiler's copy of modifiers
\r
525 fProgModifiers : integer; // values modifiers from last programm compilation
\r
527 fSpaceChars : RegExprString; //###0.927
\r
528 fWordChars : RegExprString; //###0.929
\r
529 fInvertCase : TRegExprInvertCaseFunction; //###0.927
\r
531 function IsProgrammOk : boolean; //###0.929
\r
533 procedure CheckCompModifiers;
\r
534 // if modifiers was changed after programm compilation - recompile it !
\r
536 function GetExpression : RegExprString;
\r
537 procedure SetExpression (const s : RegExprString);
\r
539 function GetModifierStr : RegExprString;
\r
540 function SetModifiersInt (const AModifiers : RegExprString; var AModifiersInt : integer) : boolean;
\r
541 procedure SetModifierStr (const AModifiers : RegExprString);
\r
543 function GetModifier (AIndex : integer) : boolean;
\r
544 procedure SetModifier (AIndex : integer; ASet : boolean);
\r
546 procedure Error (AErrorID : integer); virtual; // error handler.
\r
547 // Default handler raise exception ERegExpr with
\r
548 // Message = ErrorMsg (AErrorID), ErrorCode = AErrorID
\r
549 // and CompilerErrorPos = value of property CompilerErrorPos.
\r
552 {==================== Compiler section ===================}
\r
553 function CompileRegExpr (exp : PRegExprChar) : boolean;
\r
554 // compile a regular expression into internal code
\r
556 procedure Tail (p : PRegExprChar; val : PRegExprChar);
\r
557 // set the next-pointer at the end of a node chain
\r
559 procedure OpTail (p : PRegExprChar; val : PRegExprChar);
\r
560 // regoptail - regtail on operand of first argument; nop if operandless
\r
562 function EmitNode (op : TREOp) : PRegExprChar;
\r
563 // regnode - emit a node, return location
\r
565 procedure EmitC (b : REChar);
\r
566 // emit (if appropriate) a byte of code
\r
568 procedure InsertOperator (op : TREOp; opnd : PRegExprChar; sz : integer); //###0.90
\r
569 // insert an operator in front of already-emitted operand
\r
570 // Means relocating the operand.
\r
572 function ParseReg (paren : integer; var flagp : integer) : PRegExprChar;
\r
573 // regular expression, i.e. main body or parenthesized thing
\r
575 function ParseBranch (var flagp : integer) : PRegExprChar;
\r
576 // one alternative of an | operator
\r
578 function ParsePiece (var flagp : integer) : PRegExprChar;
\r
579 // something followed by possible [*+?]
\r
581 function ParseAtom (var flagp : integer) : PRegExprChar;
\r
582 // the lowest level
\r
584 function GetCompilerErrorPos : integer;
\r
585 // current pos in r.e. - for error hanling
\r
587 {$IFDEF UseFirstCharSet} //###0.929
\r
588 procedure FillFirstCharSet (prog : PRegExprChar);
\r
591 {===================== Mathing section ===================}
\r
592 function regrepeat (p : PRegExprChar; AMax : integer) : integer;
\r
593 // repeatedly match something simple, report how many
\r
595 function regnext (p : PRegExprChar) : PRegExprChar;
\r
596 // dig the "next" pointer out of a node
\r
598 function MatchPrim (prog : PRegExprChar) : boolean;
\r
599 // recursively matching routine
\r
601 function RegMatch (str : PRegExprChar) : boolean;
\r
602 // try match at specific point, uses MatchPrim for real work
\r
604 function ExecPrim (AOffset: integer) : boolean;
\r
605 // Exec for stored InputString
\r
607 {$IFDEF DebugRegExpr}
\r
608 function DumpOp (op : REChar) : RegExprString;
\r
611 function GetSubExprMatchCount : integer;
\r
612 function GetMatchPos (Idx : integer) : integer;
\r
613 function GetMatchLen (Idx : integer) : integer;
\r
614 function GetMatch (Idx : integer) : RegExprString;
\r
616 function GetInputString : RegExprString;
\r
617 procedure SetInputString (const AInputString : RegExprString);
\r
619 {$IFNDEF UseSetOfChar}
\r
620 function StrScanCI (s : PRegExprChar; ch : REChar) : PRegExprChar; //###0.928
\r
624 constructor Create;
\r
625 destructor Destroy; override;
\r
627 property Expression : RegExprString read GetExpression write SetExpression;
\r
628 // regular expression
\r
629 // When you assign r.e. to this property, TRegExpr will automatically
\r
630 // compile it and store in internal structures.
\r
631 // In case of compilation error, Error method will be called
\r
632 // (by default Error method raises exception ERegExpr - see below)
\r
634 property ModifierStr : RegExprString read GetModifierStr write SetModifierStr;
\r
635 // Set/get default values of r.e.syntax modifiers. Modifiers in
\r
636 // r.e. (?ismx-ismx) will replace this default values.
\r
637 // If you try to set unsupported modifier, Error will be called
\r
638 // (by defaul Error raises exception ERegExpr).
\r
640 property ModifierI : boolean index 1 read GetModifier write SetModifier;
\r
641 // Modifier /i - caseinsensitive, false by default
\r
643 property ModifierR : boolean index 2 read GetModifier write SetModifier;
\r
644 // Modifier /r - use r.e.syntax extended for russian, true by default
\r
645 // (was property ExtSyntaxEnabled in previous versions)
\r
646 // If true, then à-ÿ additional include russian letter '¸',
\r
647 // À-ß additional include '¨', and à-ß include all russian symbols.
\r
648 // You have to turn it off if it may interfere with you national alphabet.
\r
650 property ModifierS : boolean index 3 read GetModifier write SetModifier;
\r
651 // Modifier /s - '.' works as any char (else as [^\n]),
\r
654 function Exec (const AInputString : RegExprString) : boolean;
\r
655 // match a programm against a string AInputString
\r
656 // !!! Exec store AInputString into InputString property
\r
658 function ExecNext : boolean;
\r
659 // find next match:
\r
660 // Exec (AString); ExecNext;
\r
663 // if MatchLen [0] = 0 then ExecPos (MatchPos [0] + 1)
\r
664 // else ExecPos (MatchPos [0] + MatchLen [0]);
\r
665 // but it's more simpler !
\r
667 function ExecPos (AOffset: integer {$IFDEF D4_}= 1{$ENDIF}) : boolean;
\r
668 // find match for InputString starting from AOffset position
\r
669 // (AOffset=1 - first char of InputString)
\r
671 property InputString : RegExprString read GetInputString write SetInputString;
\r
672 // returns current input string (from last Exec call or last assign
\r
673 // to this property).
\r
674 // Any assignment to this property clear Match* properties !
\r
676 function Substitute (const ATemplate : RegExprString) : RegExprString;
\r
677 // Returns ATemplate with '$&' or '$0' replaced by whole r.e.
\r
678 // occurence and '$n' replaced by occurence of subexpression #n.
\r
679 // Since v.0.929 '$' used instead of '\' (for future extensions
\r
680 // and for more Perl-compatibility) and accept more then one digit.
\r
681 // If you want place into template raw '$' or '\', use prefix '\'
\r
682 // Example: '1\$ is $2\\rub\\' -> '1$ is <Match[2]>\rub\'
\r
683 // If you want to place raw digit after '$n' you must delimit
\r
684 // n with curly braces '{}'.
\r
685 // Example: 'a$12bc' -> 'a<Match[12]>bc'
\r
686 // 'a${1}2bc' -> 'a<Match[1]>2bc'.
\r
688 procedure Split (AInputStr : RegExprString; APieces : TStrings);
\r
689 // Split AInputStr into APieces by r.e. occurencies
\r
691 function Replace (AInputStr : RegExprString; const AReplaceStr : RegExprString) : RegExprString;
\r
692 // Returns AInputStr with r.e. occurencies replaced by AReplaceStr
\r
694 property SubExprMatchCount : integer read GetSubExprMatchCount;
\r
695 // Number of subexpressions has been found in last Exec* call.
\r
696 // If there are no subexpr. but whole expr was found (Exec* returned True),
\r
697 // then SubExprMatchCount=0, if no subexpressions nor whole
\r
698 // r.e. found (Exec* returned false) then SubExprMatchCount=-1.
\r
699 // Note, that some subexpr. may be not found and for such
\r
700 // subexpr. MathPos=MatchLen=-1 and Match=''.
\r
701 // For example: Expression := '(1)?2(3)?';
\r
702 // Exec ('123'): SubExprMatchCount=2, Match[0]='123', [1]='1', [2]='3'
\r
703 // Exec ('12'): SubExprMatchCount=1, Match[0]='23', [1]='1'
\r
704 // Exec ('23'): SubExprMatchCount=2, Match[0]='23', [1]='', [2]='3'
\r
705 // Exec ('2'): SubExprMatchCount=0, Match[0]='2'
\r
706 // Exec ('7') - return False: SubExprMatchCount=-1
\r
708 property MatchPos [Idx : integer] : integer read GetMatchPos;
\r
709 // pos of entrance subexpr. #Idx into tested in last Exec*
\r
710 // string. First subexpr. have Idx=1, last - MatchCount,
\r
711 // whole r.e. have Idx=0.
\r
712 // Returns -1 if in r.e. no such subexpr. or this subexpr.
\r
713 // not found in input string.
\r
715 property MatchLen [Idx : integer] : integer read GetMatchLen;
\r
716 // len of entrance subexpr. #Idx r.e. into tested in last Exec*
\r
717 // string. First subexpr. have Idx=1, last - MatchCount,
\r
718 // whole r.e. have Idx=0.
\r
719 // Returns -1 if in r.e. no such subexpr. or this subexpr.
\r
720 // not found in input string.
\r
721 // Remember - MatchLen may be 0 (if r.e. match empty string) !
\r
723 property Match [Idx : integer] : RegExprString read GetMatch;
\r
724 // == copy (InputString, MatchPos [Idx], MatchLen [Idx])
\r
725 // Returns '' if in r.e. no such subexpr. or this subexpr.
\r
726 // not found in input string.
\r
728 function LastError : integer;
\r
729 // Returns ID of last error, 0 if no errors (unusable if
\r
730 // Error method raises exception) and clear internal status
\r
731 // into 0 (no errors).
\r
733 function ErrorMsg (AErrorID : integer) : RegExprString; virtual;
\r
734 // Returns Error message for error with ID = AErrorID.
\r
736 property CompilerErrorPos : integer read GetCompilerErrorPos;
\r
737 // Returns pos in r.e. there compiler stopped.
\r
738 // Usefull for error diagnostics
\r
740 property SpaceChars : RegExprString read fSpaceChars write fSpaceChars; //###0.927
\r
741 // Contains chars, treated as /s (initially filled with RegExprSpaceChars
\r
742 // global constant)
\r
744 property WordChars : RegExprString read fWordChars write fWordChars; //###0.929
\r
745 // Contains chars, treated as /w (initially filled with RegExprWordChars
\r
746 // global constant)
\r
748 class function InvertCaseFunction (const Ch : REChar) : REChar;
\r
749 // Converts Ch into upper case if it in lower case or in lower
\r
750 // if it in upper (uses current system local setings)
\r
752 property InvertCase : TRegExprInvertCaseFunction read fInvertCase write fInvertCase; //##0.935
\r
753 // Set this property if you want to override case-insensitive functionality.
\r
754 // Create set it to RegExprInvertCaseFunction (InvertCaseFunction by default)
\r
756 {$IFDEF DebugRegExpr}
\r
757 function Dump : RegExprString;
\r
758 // dump a compiled regexp in vaguely comprehensible form
\r
762 {** Excepción de expresiones regulares}
\r
763 ERegExpr = class (Exception)
\r
765 ErrorCode : integer;
\r
766 CompilerErrorPos : integer;
\r
770 RegExprInvertCaseFunction : TRegExprInvertCaseFunction = TRegExpr.InvertCaseFunction;
\r
771 // defaul for InvertCase property
\r
773 {** true if string AInputString match regular expression ARegExpr
\r
774 ! will raise exeption if syntax errors in ARegExpr}
\r
775 function ExecRegExpr (const ARegExpr, AInputStr : RegExprString) : boolean;
\r
777 {** Split AInputStr into APieces by r.e. ARegExpr occurencies}
\r
778 procedure SplitRegExpr (const ARegExpr, AInputStr : RegExprString; APieces : TStrings);
\r
780 {** Returns AInputStr with r.e. occurencies replaced by AReplaceStr}
\r
781 function ReplaceRegExpr (const ARegExpr, AInputStr, AReplaceStr : RegExprString) : RegExprString;
\r
783 {** Replace all metachars with its safe representation,
\r
784 for example 'abc$cd.(' converts into 'abc\$cd\.\('
\r
785 This function usefull for r.e. autogeneration from
\r
787 function QuoteRegExprMetaChars (const AStr : RegExprString) : RegExprString;
\r
792 Windows; // CharUpper/Lower
\r
795 MaskModI = 1; // modifier /i bit in fModifiers
\r
796 MaskModR = 2; // -"- /r
\r
797 MaskModS = 4; // -"- /s
\r
799 {=============================================================}
\r
800 {=================== WideString functions ====================}
\r
801 {=============================================================}
\r
805 function StrPCopy (Dest: PRegExprChar; const Source: RegExprString): PRegExprChar;
\r
809 Len := length (Source); //###0.932
\r
810 for i := 1 to Len do
\r
811 Dest [i - 1] := Source [i];
\r
814 end; { of function StrPCopy
\r
815 --------------------------------------------------------------}
\r
817 function StrLCopy (Dest, Source: PRegExprChar; MaxLen: Cardinal): PRegExprChar;
\r
820 for i := 0 to MaxLen - 1 do
\r
821 Dest [i] := Source [i];
\r
823 end; { of function StrLCopy
\r
824 --------------------------------------------------------------}
\r
826 function StrLen (Str: PRegExprChar): Cardinal;
\r
829 while Str [result] <> #0
\r
831 end; { of function StrLen
\r
832 --------------------------------------------------------------}
\r
834 function StrPos (Str1, Str2: PRegExprChar): PRegExprChar;
\r
838 n := Pos (RegExprString (Str2), RegExprString (Str1));
\r
841 Result := Str1 + n - 1;
\r
842 end; { of function StrPos
\r
843 --------------------------------------------------------------}
\r
845 function StrLComp (Str1, Str2: PRegExprChar; MaxLen: Cardinal): Integer;
\r
846 var S1, S2: RegExprString;
\r
850 if Copy (S1, 1, MaxLen) > Copy (S2, 1, MaxLen)
\r
853 if Copy (S1, 1, MaxLen) < Copy (S2, 1, MaxLen)
\r
856 end; { function StrLComp
\r
857 --------------------------------------------------------------}
\r
859 function StrScan (Str: PRegExprChar; Chr: WideChar): PRegExprChar;
\r
862 while (Str^ <> #0) and (Str^ <> Chr)
\r
865 then Result := Str;
\r
866 end; { of function StrScan
\r
867 --------------------------------------------------------------}
\r
871 {=============================================================}
\r
872 {===================== Global functions ======================}
\r
873 {=============================================================}
\r
875 function ExecRegExpr (const ARegExpr, AInputStr : RegExprString) : boolean;
\r
878 r := TRegExpr.Create;
\r
880 r.Expression := ARegExpr;
\r
881 Result := r.Exec (AInputStr);
\r
884 end; { of function ExecRegExpr
\r
885 --------------------------------------------------------------}
\r
887 procedure SplitRegExpr (const ARegExpr, AInputStr : RegExprString; APieces : TStrings);
\r
891 r := TRegExpr.Create;
\r
893 r.Expression := ARegExpr;
\r
894 r.Split (AInputStr, APieces);
\r
897 end; { of procedure SplitRegExpr
\r
898 --------------------------------------------------------------}
\r
900 function ReplaceRegExpr (const ARegExpr, AInputStr, AReplaceStr : RegExprString) : RegExprString;
\r
903 r := TRegExpr.Create;
\r
905 r.Expression := ARegExpr;
\r
906 Result := r.Replace (AInputStr, AReplaceStr);
\r
909 end; { of function ReplaceRegExpr
\r
910 --------------------------------------------------------------}
\r
912 function QuoteRegExprMetaChars (const AStr : RegExprString) : RegExprString;
\r
914 RegExprMetaSet : RegExprString = '^$.[()|?+*\{'
\r
915 + ']}'; // - this last are additional to META.
\r
916 // Very similar to META array, but slighly changed.
\r
917 // !Any changes in META array must be synchronized with this set.
\r
919 i, i0, Len : integer;
\r
922 Len := length (AStr);
\r
925 while i <= Len do begin
\r
926 if Pos (AStr [i], RegExprMetaSet) > 0 then begin
\r
927 Result := Result + System.Copy (AStr, i0, i - i0)
\r
933 Result := Result + System.Copy (AStr, i0, MaxInt); // Tail
\r
934 end; { of function QuoteRegExprMetaChars
\r
935 --------------------------------------------------------------}
\r
940 MAGIC = TREOp (216);// programm signature
\r
942 // name opcode opnd? meaning
\r
943 EEND = TREOp (0); // - End of program
\r
944 BOL = TREOp (1); // - Match "" at beginning of line
\r
945 EOL = TREOp (2); // - Match "" at end of line
\r
946 ANY = TREOp (3); // - Match any one character
\r
947 ANYOF = TREOp (4); // Str Match any character in string Str
\r
948 ANYBUT = TREOp (5); // Str Match any char. not in string Str
\r
949 BRANCH = TREOp (6); // Node Match this alternative, or the next
\r
950 BACK = TREOp (7); // - Jump backward (Next < 0)
\r
951 EXACTLY = TREOp (8); // Str Match string Str
\r
952 NOTHING = TREOp (9); // - Match empty string
\r
953 STAR = TREOp (10); // Node Match this (simple) thing 0 or more times
\r
954 PLUS = TREOp (11); // Node Match this (simple) thing 1 or more times
\r
955 ANYDIGIT = TREOp (12); // - Match any digit (equiv [0-9])
\r
956 NOTDIGIT = TREOp (13); // - Match not digit (equiv [0-9])
\r
957 ANYLETTER = TREOp (14); // - Match any letter from property WordChars
\r
958 NOTLETTER = TREOp (15); // - Match not letter from property WordChars
\r
959 ANYSPACE = TREOp (16); // - Match any space char (see property SpaceChars)
\r
960 NOTSPACE = TREOp (17); // - Match not space char (see property SpaceChars)
\r
961 BRACES = TREOp (18); // Node,Min,Max Match this (simple) thing from Min to Max times.
\r
962 // Min and Max are TREBracesArg
\r
963 COMMENT = TREOp (19); // - Comment ;)
\r
964 EXACTLYCI = TREOp (20); // Str Match string Str case insensitive
\r
965 ANYOFCI = TREOp (21); // Str Match any character in string Str, case insensitive
\r
966 ANYBUTCI = TREOp (22); // Str Match any char. not in string Str, case insensitive
\r
967 LOOPENTRY = TREOp (23); // Node Start of loop (Node - LOOP for this loop)
\r
968 LOOP = TREOp (24); // Node,Min,Max,LoopEntryJmp - back jump for LOOPENTRY.
\r
969 // Min and Max are TREBracesArg
\r
970 // Node - next node in sequence,
\r
971 // LoopEntryJmp - associated LOOPENTRY node addr
\r
972 ANYOFTINYSET= TREOp (25); // Chrs Match any one char from Chrs (exactly TinySetLen chars)
\r
973 ANYBUTTINYSET=TREOp (26); // Chrs Match any one char not in Chrs (exactly TinySetLen chars)
\r
974 ANYOFFULLSET= TREOp (27); // Set Match any one char from set of char
\r
975 // - very fast (one CPU instruction !) but takes 32 bytes of p-code
\r
976 BSUBEXP = TREOp (28); // Idx Match previously matched subexpression #Idx (stored as REChar) //###0.936
\r
977 BSUBEXPCI = TREOp (29); // Idx -"- in case-insensitive mode
\r
980 // !!! Change OPEN value if you add new opcodes !!!
\r
982 OPEN = TREOp (30); // - Mark this point in input as start of \n
\r
983 // OPEN + 1 is \1, etc.
\r
984 CLOSE = TREOp (ord (OPEN) + NSUBEXP);
\r
985 // - Analogous to OPEN.
\r
987 // !!! Don't add new OpCodes after CLOSE !!!
\r
989 // We work with p-code thru pointers, compatible with PRegExprChar.
\r
990 // Note: all code components (TRENextOff, TREOp, TREBracesArg, etc)
\r
991 // must have lengths that can be divided by SizeOf (REChar) !
\r
992 // A node is TREOp of opcode followed Next "pointer" of TRENextOff type.
\r
993 // The Next is a offset from the opcode of the node containing it.
\r
994 // An operand, if any, simply follows the node. (Note that much of
\r
995 // the code generation knows about this implicit relationship!)
\r
996 // Using TRENextOff=integer speed up p-code processing.
\r
998 // Opcodes description:
\r
1000 // BRANCH The set of branches constituting a single choice are hooked
\r
1001 // together with their "next" pointers, since precedence prevents
\r
1002 // anything being concatenated to any individual branch. The
\r
1003 // "next" pointer of the last BRANCH in a choice points to the
\r
1004 // thing following the whole choice. This is also where the
\r
1005 // final "next" pointer of each individual branch points; each
\r
1006 // branch starts with the operand node of a BRANCH node.
\r
1007 // BACK Normal "next" pointers all implicitly point forward; BACK
\r
1008 // exists to make loop structures possible.
\r
1009 // STAR,PLUS,BRACES '?', and complex '*' and '+', are implemented as
\r
1010 // circular BRANCH structures using BACK. Complex '{min,max}'
\r
1011 // - as pair LOOPENTRY-LOOP (see below). Simple cases (one
\r
1012 // character per match) are implemented with STAR, PLUS and
\r
1013 // BRACES for speed and to minimize recursive plunges.
\r
1014 // LOOPENTRY,LOOP {min,max} are implemented as special pair
\r
1015 // LOOPENTRY-LOOP. Each LOOPENTRY initialize loopstack for
\r
1017 // OPEN,CLOSE are numbered at compile time.
\r
1020 {=============================================================}
\r
1021 {================== Error handling section ===================}
\r
1022 {=============================================================}
\r
1026 reeCompNullArgument = 100;
\r
1027 reeCompRegexpTooBig = 101;
\r
1028 reeCompParseRegTooManyBrackets = 102;
\r
1029 reeCompParseRegUnmatchedBrackets = 103;
\r
1030 reeCompParseRegUnmatchedBrackets2 = 104;
\r
1031 reeCompParseRegJunkOnEnd = 105;
\r
1032 reePlusStarOperandCouldBeEmpty = 106;
\r
1033 reeNestedSQP = 107;
\r
1034 reeBadHexDigit = 108;
\r
1035 reeInvalidRange = 109;
\r
1036 reeParseAtomTrailingBackSlash = 110;
\r
1037 reeNoHexCodeAfterBSlashX = 111;
\r
1038 reeHexCodeAfterBSlashXTooBig = 112;
\r
1039 reeUnmatchedSqBrackets = 113;
\r
1040 reeInternalUrp = 114;
\r
1041 reeQPSBFollowsNothing = 115;
\r
1042 reeTrailingBackSlash = 116;
\r
1043 reeRarseAtomInternalDisaster = 119;
\r
1044 reeBRACESArgTooBig = 122;
\r
1045 reeBracesMinParamGreaterMax = 124;
\r
1046 reeUnclosedComment = 125;
\r
1047 reeComplexBracesNotImplemented = 126;
\r
1048 reeUrecognizedModifier = 127;
\r
1049 reeRegRepeatCalledInappropriately = 1000;
\r
1050 reeMatchPrimMemoryCorruption = 1001;
\r
1051 reeMatchPrimCorruptedPointers = 1002;
\r
1052 reeNoExpression = 1003;
\r
1053 reeCorruptedProgram = 1004;
\r
1054 reeNoInpitStringSpecified = 1005;
\r
1055 reeOffsetMustBeGreaterThen0 = 1006;
\r
1056 reeExecNextWithoutExec = 1007;
\r
1057 reeGetInputStringWithoutInputString = 1008;
\r
1058 reeDumpCorruptedOpcode = 1011;
\r
1059 reeExecAfterCompErr = 1012;
\r
1060 reeModifierUnsupported = 1013;
\r
1061 reeLoopStackExceeded = 1014;
\r
1062 reeLoopWithoutEntry = 1015;
\r
1064 function TRegExpr.ErrorMsg (AErrorID : integer) : RegExprString;
\r
1067 reeOk: Result := 'No errors';
\r
1068 reeCompNullArgument: Result := 'TRegExpr(comp): Null Argument';
\r
1069 reeCompRegexpTooBig: Result := 'TRegExpr(comp): Regexp Too Big';
\r
1070 reeCompParseRegTooManyBrackets: Result := 'TRegExpr(comp): ParseReg Too Many ()';
\r
1071 reeCompParseRegUnmatchedBrackets: Result := 'TRegExpr(comp): ParseReg Unmatched ()';
\r
1072 reeCompParseRegUnmatchedBrackets2: Result := 'TRegExpr(comp): ParseReg Unmatched ()';
\r
1073 reeCompParseRegJunkOnEnd: Result := 'TRegExpr(comp): ParseReg Junk On End';
\r
1074 reePlusStarOperandCouldBeEmpty: Result := 'TRegExpr(comp): *+ Operand Could Be Empty';
\r
1075 reeNestedSQP: Result := 'TRegExpr(comp): Nested *?+';
\r
1076 reeBadHexDigit: Result := 'TRegExpr(comp): Bad Hex Digit';
\r
1077 reeInvalidRange: Result := 'TRegExpr(comp): Invalid [] Range';
\r
1078 reeParseAtomTrailingBackSlash: Result := 'TRegExpr(comp): Parse Atom Trailing \';
\r
1079 reeNoHexCodeAfterBSlashX: Result := 'TRegExpr(comp): No Hex Code After \x';
\r
1080 reeHexCodeAfterBSlashXTooBig: Result := 'TRegExpr(comp): Hex Code After \x Is Too Big';
\r
1081 reeUnmatchedSqBrackets: Result := 'TRegExpr(comp): Unmatched []';
\r
1082 reeInternalUrp: Result := 'TRegExpr(comp): Internal Urp';
\r
1083 reeQPSBFollowsNothing: Result := 'TRegExpr(comp): ?+*{ Follows Nothing';
\r
1084 reeTrailingBackSlash: Result := 'TRegExpr(comp): Trailing \';
\r
1085 reeRarseAtomInternalDisaster: Result := 'TRegExpr(comp): RarseAtom Internal Disaster';
\r
1086 reeBRACESArgTooBig: Result := 'TRegExpr(comp): BRACES Argument Too Big';
\r
1087 reeBracesMinParamGreaterMax: Result := 'TRegExpr(comp): BRACE Min Param Greater then Max';
\r
1088 reeUnclosedComment: Result := 'TRegExpr(comp): Unclosed (?#Comment)';
\r
1089 reeComplexBracesNotImplemented: Result := 'TRegExpr(comp): If you want take part in beta-testing BRACES ''{min,max}'' for complex cases - remove ''.'' from {.$DEFINE ComplexBraces}';
\r
1090 reeUrecognizedModifier: Result := 'TRegExpr(comp): Urecognized Modifier';
\r
1092 reeRegRepeatCalledInappropriately: Result := 'TRegExpr(exec): RegRepeat Called Inappropriately';
\r
1093 reeMatchPrimMemoryCorruption: Result := 'TRegExpr(exec): MatchPrim Memory Corruption';
\r
1094 reeMatchPrimCorruptedPointers: Result := 'TRegExpr(exec): MatchPrim Corrupted Pointers';
\r
1095 reeNoExpression: Result := 'TRegExpr(exec): Not Assigned Expression Property';
\r
1096 reeCorruptedProgram: Result := 'TRegExpr(exec): Corrupted Program';
\r
1097 reeNoInpitStringSpecified: Result := 'TRegExpr(exec): No Inpit String Specified';
\r
1098 reeOffsetMustBeGreaterThen0: Result := 'TRegExpr(exec): Offset Must Be Greater Then 0';
\r
1099 reeExecNextWithoutExec: Result := 'TRegExpr(exec): ExecNext Without Exec[Pos]';
\r
1100 reeGetInputStringWithoutInputString: Result := 'TRegExpr(exec): GetInputString Without InputString';
\r
1101 reeDumpCorruptedOpcode: Result := 'TRegExpr(dump): Corrupted Opcode';
\r
1102 reeExecAfterCompErr: Result := 'TRegExpr(exec): Exec After Compilation Error';
\r
1103 reeLoopStackExceeded: Result := 'TRegExpr(exec): Loop Stack Exceeded';
\r
1104 reeLoopWithoutEntry: Result := 'TRegExpr(exec): Loop Without LoopEntry !';
\r
1105 else Result := 'Unknown error';
\r
1107 end; { of procedure TRegExpr.Error
\r
1108 --------------------------------------------------------------}
\r
1110 function TRegExpr.LastError : integer;
\r
1112 Result := fLastError;
\r
1113 fLastError := reeOk;
\r
1114 end; { of function TRegExpr.LastError
\r
1115 --------------------------------------------------------------}
\r
1118 {=============================================================}
\r
1119 {===================== Common section ========================}
\r
1120 {=============================================================}
\r
1122 constructor TRegExpr.Create;
\r
1126 fExpression := nil;
\r
1127 fInputString := nil;
\r
1130 fExprIsCompiled := false;
\r
1132 ModifierI := RegExprModifierI;
\r
1133 ModifierR := RegExprModifierR;
\r
1134 ModifierS := RegExprModifierS;
\r
1136 SpaceChars := RegExprSpaceChars; //###0.927
\r
1137 WordChars := RegExprWordChars; //###0.929
\r
1138 fInvertCase := RegExprInvertCaseFunction; //###0.927
\r
1139 end; { of constructor TRegExpr.Create
\r
1140 --------------------------------------------------------------}
\r
1142 destructor TRegExpr.Destroy;
\r
1144 if programm <> nil
\r
1145 then FreeMem (programm);
\r
1146 if fExpression <> nil
\r
1147 then FreeMem (fExpression);
\r
1148 if fInputString <> nil
\r
1149 then FreeMem (fInputString);
\r
1150 end; { of destructor TRegExpr.Destroy
\r
1151 --------------------------------------------------------------}
\r
1153 class function TRegExpr.InvertCaseFunction (const Ch : REChar) : REChar;
\r
1161 Result := REChar (CharUpper (pointer (Ch)));
\r
1163 then Result := REChar (CharLower (pointer (Ch)));
\r
1165 end; { of function TRegExpr.InvertCaseFunction
\r
1166 --------------------------------------------------------------}
\r
1168 function TRegExpr.GetExpression : RegExprString;
\r
1170 if fExpression <> nil
\r
1171 then Result := fExpression
\r
1172 else Result := '';
\r
1173 end; { of function TRegExpr.GetExpression
\r
1174 --------------------------------------------------------------}
\r
1176 procedure TRegExpr.SetExpression (const s : RegExprString);
\r
1178 if (s <> fExpression) or not fExprIsCompiled then begin
\r
1179 fExprIsCompiled := false;
\r
1180 if fExpression <> nil then begin
\r
1181 FreeMem (fExpression);
\r
1182 fExpression := nil;
\r
1184 if s <> '' then begin
\r
1185 GetMem (fExpression, (length (s) + 1) * SizeOf (REChar));
\r
1186 CompileRegExpr (StrPCopy (fExpression, s));
\r
1189 end; { of procedure TRegExpr.SetExpression
\r
1190 --------------------------------------------------------------}
\r
1192 function TRegExpr.GetSubExprMatchCount : integer;
\r
1194 if Assigned (fInputString) then begin
\r
1195 Result := NSUBEXP - 1;
\r
1196 while (Result > 0) and ((startp [Result] = nil)
\r
1197 or (endp [Result] = nil))
\r
1200 else Result := -1;
\r
1201 end; { of function TRegExpr.GetSubExprMatchCount
\r
1202 --------------------------------------------------------------}
\r
1204 function TRegExpr.GetMatchPos (Idx : integer) : integer;
\r
1206 if (Idx >= 0) and (Idx < NSUBEXP) and Assigned (fInputString)
\r
1207 and Assigned (startp [Idx]) and Assigned (endp [Idx]) then begin
\r
1208 Result := (startp [Idx] - fInputString) + 1;
\r
1210 else Result := -1;
\r
1211 end; { of function TRegExpr.GetMatchPos
\r
1212 --------------------------------------------------------------}
\r
1214 function TRegExpr.GetMatchLen (Idx : integer) : integer;
\r
1216 if (Idx >= 0) and (Idx < NSUBEXP) and Assigned (fInputString)
\r
1217 and Assigned (startp [Idx]) and Assigned (endp [Idx]) then begin
\r
1218 Result := endp [Idx] - startp [Idx];
\r
1220 else Result := -1;
\r
1221 end; { of function TRegExpr.GetMatchLen
\r
1222 --------------------------------------------------------------}
\r
1224 function TRegExpr.GetMatch (Idx : integer) : RegExprString;
\r
1226 if (Idx >= 0) and (Idx < NSUBEXP) and Assigned (fInputString)
\r
1227 and Assigned (startp [Idx]) and Assigned (endp [Idx])
\r
1228 //then Result := copy (fInputString, MatchPos [Idx], MatchLen [Idx]) //###0.929
\r
1229 then SetString (Result, startp [idx], endp [idx] - startp [idx])
\r
1230 else Result := '';
\r
1231 end; { of function TRegExpr.GetMatch
\r
1232 --------------------------------------------------------------}
\r
1234 function TRegExpr.IsProgrammOk : boolean;
\r
1237 if programm = nil // No compiled r.e. present
\r
1238 then Error (reeNoExpression)
\r
1239 else if programm [0] <> MAGIC // Program corrupted.
\r
1240 then Error (reeCorruptedProgram)
\r
1241 else if not fExprIsCompiled // Previous compilation was finished with error
\r
1242 then Error (reeExecAfterCompErr)
\r
1243 else Result := true;
\r
1244 end; { of function TRegExpr.IsProgrammOk
\r
1245 --------------------------------------------------------------}
\r
1247 procedure TRegExpr.CheckCompModifiers;
\r
1249 if (programm <> nil) and (fExpression <> nil)
\r
1250 and (fModifiers <> fProgModifiers)
\r
1251 then CompileRegExpr (fExpression);
\r
1252 end; { of TRegExpr.CheckCompModifiers
\r
1253 --------------------------------------------------------------}
\r
1255 function TRegExpr.GetModifierStr : RegExprString;
\r
1260 then Result := 'i' + Result
\r
1261 else Result := Result + 'i';
\r
1263 then Result := 'r' + Result
\r
1264 else Result := Result + 'r';
\r
1266 then Result := 's' + Result
\r
1267 else Result := Result + 's';
\r
1269 if Result [length (Result)] = '-' // remove '-' if all modifiers are 'On'
\r
1270 then System.Delete (Result, length (Result), 1);
\r
1271 end; { of function TRegExpr.GetModifierStr
\r
1272 --------------------------------------------------------------}
\r
1274 function TRegExpr.SetModifiersInt (const AModifiers : RegExprString; var AModifiersInt : integer) : boolean;
\r
1282 Mask := 0; // strange compiler varning
\r
1283 for i := 1 to length (AModifiers) do
\r
1284 if AModifiers [i] = '-'
\r
1285 then IsOn := false
\r
1287 if Pos (AModifiers [i], 'iI') > 0
\r
1288 then Mask := MaskModI
\r
1289 else if Pos (AModifiers [i], 'rR') > 0
\r
1290 then Mask := MaskModR
\r
1291 else if Pos (AModifiers [i], 'sS') > 0
\r
1292 then Mask := MaskModS
\r
1298 then AModifiersInt := AModifiersInt or Mask
\r
1299 else AModifiersInt := AModifiersInt and not Mask;
\r
1301 end; { of function TRegExpr.SetModifiersInt
\r
1302 --------------------------------------------------------------}
\r
1304 procedure TRegExpr.SetModifierStr (const AModifiers : RegExprString);
\r
1306 if not SetModifiersInt (AModifiers, fModifiers)
\r
1307 then Error (reeModifierUnsupported);
\r
1308 CheckCompModifiers;
\r
1309 end; { of procedure TRegExpr.SetModifierStr
\r
1310 --------------------------------------------------------------}
\r
1312 function TRegExpr.GetModifier (AIndex : integer) : boolean;
\r
1318 1: Mask := MaskModI;
\r
1319 2: Mask := MaskModR;
\r
1320 3: Mask := MaskModS;
\r
1322 Error (reeModifierUnsupported);
\r
1326 Result := (fModifiers and Mask) = Mask;
\r
1327 end; { of function TRegExpr.GetModifier
\r
1328 --------------------------------------------------------------}
\r
1330 procedure TRegExpr.SetModifier (AIndex : integer; ASet : boolean);
\r
1335 1: Mask := MaskModI;
\r
1336 2: Mask := MaskModR;
\r
1337 3: Mask := MaskModS;
\r
1339 Error (reeModifierUnsupported);
\r
1344 then fModifiers := fModifiers or Mask
\r
1345 else fModifiers := fModifiers and not Mask;
\r
1346 CheckCompModifiers;
\r
1347 end; { of procedure TRegExpr.SetModifier
\r
1348 --------------------------------------------------------------}
\r
1351 {=============================================================}
\r
1352 {==================== Compiler section =======================}
\r
1353 {=============================================================}
\r
1355 procedure TRegExpr.Tail (p : PRegExprChar; val : PRegExprChar);
\r
1356 // set the next-pointer at the end of a node chain
\r
1358 scan : PRegExprChar;
\r
1359 temp : PRegExprChar;
\r
1363 // Find last node.
\r
1366 temp := regnext (scan);
\r
1371 // Set Next 'pointer'
\r
1372 PRENextOff (scan + REOpSz)^ := val - scan; //###0.933
\r
1373 end; { of procedure TRegExpr.Tail
\r
1374 --------------------------------------------------------------}
\r
1376 procedure TRegExpr.OpTail (p : PRegExprChar; val : PRegExprChar);
\r
1377 // regtail on operand of first argument; nop if operandless
\r
1379 // "Operandless" and "op != BRANCH" are synonymous in practice.
\r
1380 if (p = nil) or (p = @regdummy) or (PREOp (p)^ <> BRANCH)
\r
1382 Tail (p + REOpSz + RENextOffSz, val); //###0.933
\r
1383 end; { of procedure TRegExpr.OpTail
\r
1384 --------------------------------------------------------------}
\r
1386 function TRegExpr.EmitNode (op : TREOp) : PRegExprChar; //###0.933
\r
1387 // emit a node, return location
\r
1389 Result := regcode;
\r
1390 if Result <> @regdummy then begin
\r
1391 PREOp (regcode)^ := op;
\r
1392 inc (regcode, REOpSz);
\r
1393 PRENextOff (regcode)^ := 0; // Next "pointer" := nil
\r
1394 inc (regcode, RENextOffSz);
\r
1396 else inc (regsize, REOpSz + RENextOffSz); // compute code size without code generation
\r
1397 end; { of function TRegExpr.EmitNode
\r
1398 --------------------------------------------------------------}
\r
1400 procedure TRegExpr.EmitC (b : REChar);
\r
1401 // emit a byte to code
\r
1403 if regcode <> @regdummy then begin
\r
1407 else inc (regsize); // Type of p-code pointer always is ^REChar
\r
1408 end; { of procedure TRegExpr.EmitC
\r
1409 --------------------------------------------------------------}
\r
1411 procedure TRegExpr.InsertOperator (op : TREOp; opnd : PRegExprChar; sz : integer);
\r
1412 // insert an operator in front of already-emitted operand
\r
1413 // Means relocating the operand.
\r
1415 src, dst, place : PRegExprChar;
\r
1418 if regcode = @regdummy then begin
\r
1419 inc (regsize, sz);
\r
1423 inc (regcode, sz);
\r
1425 while src > opnd do begin
\r
1430 place := opnd; // Op node, where operand used to be.
\r
1431 PREOp (place)^ := op;
\r
1432 inc (place, REOpSz);
\r
1433 for i := 1 + REOpSz to sz do begin
\r
1437 end; { of procedure TRegExpr.InsertOperator
\r
1438 --------------------------------------------------------------}
\r
1440 function strcspn (s1 : PRegExprChar; s2 : PRegExprChar) : integer;
\r
1441 // find length of initial segment of s1 consisting
\r
1442 // entirely of characters not from s2
\r
1443 var scan1, scan2 : PRegExprChar;
\r
1447 while scan1^ <> #0 do begin
\r
1449 while scan2^ <> #0 do
\r
1450 if scan1^ = scan2^
\r
1456 end; { of function strcspn
\r
1457 --------------------------------------------------------------}
\r
1460 // Flags to be passed up and down.
\r
1461 HASWIDTH = 01; // Known never to match nil string.
\r
1462 SIMPLE = 02; // Simple enough to be STAR/PLUS/BRACES operand.
\r
1463 SPSTART = 04; // Starts with * or +.
\r
1464 WORST = 0; // Worst case.
\r
1465 META : array [0 .. 12] of REChar = (
\r
1466 '^', '$', '.', '[', '(', ')', '|', '?', '+', '*', '\', '{', #0);
\r
1467 // Any modification must be synchronized with QuoteRegExprMetaChars !!!
\r
1470 RusRangeLo : array [0 .. 33] of REChar =
\r
1471 (#$430,#$431,#$432,#$433,#$434,#$435,#$451,#$436,#$437,
\r
1472 #$438,#$439,#$43A,#$43B,#$43C,#$43D,#$43E,#$43F,
\r
1473 #$440,#$441,#$442,#$443,#$444,#$445,#$446,#$447,
\r
1474 #$448,#$449,#$44A,#$44B,#$44C,#$44D,#$44E,#$44F,#0);
\r
1475 RusRangeHi : array [0 .. 33] of REChar =
\r
1476 (#$410,#$411,#$412,#$413,#$414,#$415,#$401,#$416,#$417,
\r
1477 #$418,#$419,#$41A,#$41B,#$41C,#$41D,#$41E,#$41F,
\r
1478 #$420,#$421,#$422,#$423,#$424,#$425,#$426,#$427,
\r
1479 #$428,#$429,#$42A,#$42B,#$42C,#$42D,#$42E,#$42F,#0);
\r
1480 RusRangeLoLow = #$430{'à'};
\r
1481 RusRangeLoHigh = #$44F{'ÿ'};
\r
1482 RusRangeHiLow = #$410{'À'};
\r
1483 RusRangeHiHigh = #$42F{'ß'};
\r
1485 RusRangeLo = 'àáâãäå¸æçèéêëìíîïðñòóôõö÷øùúûüýþÿ';
\r
1486 RusRangeHi = 'ÀÁÂÃÄŨÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞß';
\r
1487 RusRangeLoLow = 'à';
\r
1488 RusRangeLoHigh = 'ÿ';
\r
1489 RusRangeHiLow = 'À';
\r
1490 RusRangeHiHigh = 'ß';
\r
1493 function TRegExpr.CompileRegExpr (exp : PRegExprChar) : boolean;
\r
1494 // compile a regular expression into internal code
\r
1495 // We can't allocate space until we know how big the compiled form will be,
\r
1496 // but we can't compile it (and thus know how big it is) until we've got a
\r
1497 // place to put the code. So we cheat: we compile it twice, once with code
\r
1498 // generation turned off and size counting turned on, and once "for real".
\r
1499 // This also means that we don't allocate space until we are sure that the
\r
1500 // thing really will compile successfully, and we never have to move the
\r
1501 // code and thus invalidate pointers into it. (Note that it has to be in
\r
1502 // one piece because free() must be able to free it all.)
\r
1503 // Beware that the optimization-preparation code in here knows about some
\r
1504 // of the structure of the compiled regexp.
\r
1506 scan, longest : PRegExprChar;
\r
1510 Result := false; // life too dark
\r
1511 fExprIsCompiled := false;
\r
1513 regparse := nil; // for correct error handling
\r
1515 try // must clear regexpbeg after compilation
\r
1517 if programm <> nil then begin
\r
1518 FreeMem (programm);
\r
1522 if exp = nil then begin
\r
1523 Error (reeCompNullArgument);
\r
1527 fProgModifiers := fModifiers;
\r
1528 // well, may it's paranoia. I'll check it later... !!!!!!!!
\r
1530 // First pass: determine size, legality.
\r
1531 fCompModifiers := fModifiers;
\r
1535 regcode := @regdummy;
\r
1537 if ParseReg (0, flags) = nil
\r
1540 // Small enough for 2-bytes programm pointers ?
\r
1541 // ###0.933 no real p-code length limits now :)))
\r
1542 // if regsize >= 64 * 1024 then begin
\r
1543 // Error (reeCompRegexpTooBig);
\r
1547 // Allocate space.
\r
1548 GetMem (programm, regsize * SizeOf (REChar));
\r
1550 // Second pass: emit code.
\r
1551 fCompModifiers := fModifiers;
\r
1554 regcode := programm;
\r
1556 if ParseReg (0, flags) = nil
\r
1559 // Dig out information for optimizations.
\r
1560 {$IFDEF UseFirstCharSet} //###0.929
\r
1561 FirstCharSet := [];
\r
1562 FillFirstCharSet (programm + REOpSz);
\r
1564 regstart := #0; // Worst-case defaults.
\r
1568 scan := programm + REOpSz; // First BRANCH.
\r
1569 if PREOp (regnext (scan))^ = EEND then begin // Only one top-level choice.
\r
1570 scan := scan + REOpSz + RENextOffSz;
\r
1572 // Starting-point info.
\r
1573 if PREOp (scan)^ = EXACTLY
\r
1574 then regstart := (scan + REOpSz + RENextOffSz)^
\r
1575 else if PREOp (scan)^ = BOL
\r
1576 then inc (reganch);
\r
1578 // If there's something expensive in the r.e., find the longest
\r
1579 // literal string that must appear and make it the regmust. Resolve
\r
1580 // ties in favor of later strings, since the regstart check works
\r
1581 // with the beginning of the r.e. and avoiding duplication
\r
1582 // strengthens checking. Not a strong reason, but sufficient in the
\r
1583 // absence of others.
\r
1584 if (flags and SPSTART) <> 0 then begin
\r
1587 while scan <> nil do begin
\r
1588 if (PREOp (scan)^ = EXACTLY)
\r
1589 and (strlen (scan + REOpSz + RENextOffSz) >= len) then begin
\r
1590 longest := scan + REOpSz + RENextOffSz;
\r
1591 len := strlen (longest);
\r
1593 scan := regnext (scan);
\r
1595 regmust := longest;
\r
1600 finally regexpbeg := nil;
\r
1603 fExprIsCompiled := true;
\r
1605 end; { of function TRegExpr.CompileRegExpr
\r
1606 --------------------------------------------------------------}
\r
1608 function TRegExpr.ParseReg (paren : integer; var flagp : integer) : PRegExprChar;
\r
1609 // regular expression, i.e. main body or parenthesized thing
\r
1610 // Caller must absorb opening parenthesis.
\r
1611 // Combining parenthesis handling with the base level of regular expression
\r
1612 // is a trifle forced, but the need to tie the tails of the branches to what
\r
1613 // follows makes it hard to avoid.
\r
1615 ret, br, ender : PRegExprChar;
\r
1618 SavedModifiers : integer;
\r
1621 flagp := HASWIDTH; // Tentatively.
\r
1622 parno := 0; // eliminate compiler stupid warning
\r
1623 SavedModifiers := fCompModifiers;
\r
1625 // Make an OPEN node, if parenthesized.
\r
1626 if paren <> 0 then begin
\r
1627 if regnpar >= NSUBEXP then begin
\r
1628 Error (reeCompParseRegTooManyBrackets);
\r
1633 ret := EmitNode (TREOp (ord (OPEN) + parno));
\r
1637 // Pick up the branches, linking them together.
\r
1638 br := ParseBranch (flags);
\r
1639 if br = nil then begin
\r
1644 then Tail (ret, br) // OPEN -> first.
\r
1646 if (flags and HASWIDTH) = 0
\r
1647 then flagp := flagp and not HASWIDTH;
\r
1648 flagp := flagp or flags and SPSTART;
\r
1649 while (regparse^ = '|') do begin
\r
1651 br := ParseBranch (flags);
\r
1652 if br = nil then begin
\r
1656 Tail (ret, br); // BRANCH -> BRANCH.
\r
1657 if (flags and HASWIDTH) = 0
\r
1658 then flagp := flagp and not HASWIDTH;
\r
1659 flagp := flagp or flags and SPSTART;
\r
1662 // Make a closing node, and hook it on the end.
\r
1664 then ender := EmitNode (TREOp (ord (CLOSE) + parno))
\r
1665 else ender := EmitNode (EEND);
\r
1666 Tail (ret, ender);
\r
1668 // Hook the tails of the branches to the closing node.
\r
1670 while br <> nil do begin
\r
1671 OpTail (br, ender);
\r
1672 br := regnext (br);
\r
1675 // Check for proper termination.
\r
1676 if paren <> 0 then
\r
1677 if regparse^ <> ')' then begin
\r
1678 Error (reeCompParseRegUnmatchedBrackets);
\r
1681 else inc (regparse); // skip trailing ')'
\r
1682 if (paren = 0) and (regparse^ <> #0) then begin
\r
1683 if regparse^ = ')'
\r
1684 then Error (reeCompParseRegUnmatchedBrackets2)
\r
1685 else Error (reeCompParseRegJunkOnEnd);
\r
1688 fCompModifiers := SavedModifiers; // restore modifiers of parent
\r
1690 end; { of function TRegExpr.ParseReg
\r
1691 --------------------------------------------------------------}
\r
1693 function TRegExpr.ParseBranch (var flagp : integer) : PRegExprChar;
\r
1694 // one alternative of an | operator
\r
1695 // Implements the concatenation operator.
\r
1697 ret, chain, latest : PRegExprChar;
\r
1700 flagp := WORST; // Tentatively.
\r
1702 ret := EmitNode (BRANCH);
\r
1704 while (regparse^ <> #0) and (regparse^ <> '|')
\r
1705 and (regparse^ <> ')') do begin
\r
1706 latest := ParsePiece (flags);
\r
1707 if latest = nil then begin
\r
1711 flagp := flagp or flags and HASWIDTH;
\r
1712 if chain = nil // First piece.
\r
1713 then flagp := flagp or flags and SPSTART
\r
1714 else Tail (chain, latest);
\r
1717 if chain = nil // Loop ran zero times.
\r
1718 then EmitNode (NOTHING);
\r
1720 end; { of function TRegExpr.ParseBranch
\r
1721 --------------------------------------------------------------}
\r
1723 function TRegExpr.ParsePiece (var flagp : integer) : PRegExprChar;
\r
1724 // something followed by possible [*+?{]
\r
1725 // Note that the branching code sequences used for ? and the general cases
\r
1726 // of * and + and { are somewhat optimized: they use the same NOTHING node as
\r
1727 // both the endmarker for their branch list and the body of the last branch.
\r
1728 // It might seem that this node could be dispensed with entirely, but the
\r
1729 // endmarker role is not redundant.
\r
1730 function parsenum (AStart, AEnd : PRegExprChar) : TREBracesArg;
\r
1733 if AEnd - AStart + 1 > 8 then begin // prevent stupid scanning
\r
1734 Error (reeBRACESArgTooBig);
\r
1737 while AStart <= AEnd do begin
\r
1738 Result := Result * 10 + (ord (AStart^) - ord ('0'));
\r
1741 if (Result > MaxBracesArg) or (Result < 0) then begin
\r
1742 Error (reeBRACESArgTooBig);
\r
1748 NextNode : PRegExprChar;
\r
1750 BracesMin, Bracesmax : TREBracesArg;
\r
1751 p, savedparse : PRegExprChar;
\r
1752 {$IFDEF ComplexBraces}
\r
1756 Result := ParseAtom (flags);
\r
1761 if not ((op = '*') or (op = '+') or (op = '?') or (op = '{')) then begin
\r
1765 if ((flags and HASWIDTH) = 0) and (op <> '?') then begin
\r
1766 Error (reePlusStarOperandCouldBeEmpty);
\r
1772 flagp := WORST or SPSTART;
\r
1773 if (flags and SIMPLE) = 0 then begin
\r
1774 // Emit x* as (x&|), where & means "self".
\r
1775 InsertOperator (BRANCH, Result, REOpSz + RENextOffSz); // Either x
\r
1776 OpTail (Result, EmitNode (BACK)); // and loop
\r
1777 OpTail (Result, Result); // back
\r
1778 Tail (Result, EmitNode (BRANCH)); // or
\r
1779 Tail (Result, EmitNode (NOTHING)); // nil.
\r
1781 else InsertOperator (STAR, Result, REOpSz + RENextOffSz);
\r
1782 end; { of case '*'}
\r
1784 flagp := WORST or SPSTART or HASWIDTH;
\r
1785 if (flags and SIMPLE) = 0 then begin
\r
1786 // Emit x+ as x(&|), where & means "self".
\r
1787 NextNode := EmitNode (BRANCH); // Either
\r
1788 Tail (Result, NextNode);
\r
1789 Tail (EmitNode (BACK), Result); // loop back
\r
1790 Tail (NextNode, EmitNode (BRANCH)); // or
\r
1791 Tail (Result, EmitNode (NOTHING)); // nil.
\r
1793 else InsertOperator (PLUS, Result, REOpSz + RENextOffSz);
\r
1794 end; { of case '+'}
\r
1797 // Emit x? as (x|)
\r
1798 InsertOperator (BRANCH, Result, REOpSz + RENextOffSz); // Either x
\r
1799 Tail (Result, EmitNode (BRANCH)); // or
\r
1800 NextNode := EmitNode (NOTHING); // nil.
\r
1801 Tail (Result, NextNode);
\r
1802 OpTail (Result, NextNode);
\r
1803 end; { of case '?'}
\r
1805 savedparse := regparse;
\r
1808 while Pos (regparse^, '0123456789') > 0 // <min> MUST appear
\r
1809 do inc (regparse);
\r
1810 if (regparse^ <> '}') and (regparse^ <> ',') or (p = regparse) then begin
\r
1811 regparse := savedparse;
\r
1815 BracesMin := parsenum (p, regparse - 1);
\r
1816 if regparse^ = ',' then begin
\r
1819 while Pos (regparse^, '0123456789') > 0
\r
1820 do inc (regparse);
\r
1821 if regparse^ <> '}' then begin
\r
1822 regparse := savedparse;
\r
1826 then BracesMax := MaxBracesArg
\r
1827 else BracesMax := parsenum (p, regparse - 1);
\r
1829 else BracesMax := BracesMin; // {n} == {n,n}
\r
1830 if BracesMin > BracesMax then begin
\r
1831 Error (reeBracesMinParamGreaterMax);
\r
1835 then flagp := WORST;
\r
1837 then flagp := flagp or HASWIDTH or SPSTART;
\r
1838 if (flags and SIMPLE) <> 0 then begin
\r
1839 InsertOperator (BRACES, Result, REOpSz + RENextOffSz + REBracesArgSz * 2);
\r
1840 if regcode <> @regdummy then begin
\r
1841 PREBracesArg (Result + REOpSz + RENextOffSz)^ := BracesMin;
\r
1842 PREBracesArg (Result + REOpSz + RENextOffSz + REBracesArgSz)^ := BracesMax;
\r
1845 else begin // Emit complex x{min,max}
\r
1846 {$IFNDEF ComplexBraces}
\r
1847 Error (reeComplexBracesNotImplemented);
\r
1850 InsertOperator (LOOPENTRY, Result, REOpSz + RENextOffSz);
\r
1851 NextNode := EmitNode (LOOP);
\r
1852 if regcode <> @regdummy then begin
\r
1853 off := (Result + REOpSz + RENextOffSz)
\r
1854 - (regcode - REOpSz - RENextOffSz); // back to Atom after LOOPENTRY
\r
1855 PREBracesArg (regcode)^ := BracesMin;
\r
1856 inc (regcode, REBracesArgSz);
\r
1857 PREBracesArg (regcode)^ := BracesMax;
\r
1858 inc (regcode, REBracesArgSz);
\r
1859 PRENextOff (regcode)^ := off;
\r
1860 inc (regcode, RENextOffSz);
\r
1862 else inc (regsize, REBracesArgSz * 2 + RENextOffSz);
\r
1863 Tail (Result, NextNode); // LOOPENTRY -> LOOP
\r
1864 if regcode <> @regdummy then
\r
1865 Tail (Result + REOpSz + RENextOffSz, NextNode); // Atom -> LOOP
\r
1868 end; { of case '{'}
\r
1869 // else // here we can't be
\r
1870 end; { of case op}
\r
1873 if (regparse^ = '*') or (regparse^ = '+') or (regparse^ = '?') or (regparse^ = '{') then begin
\r
1874 Error (reeNestedSQP);
\r
1877 end; { of function TRegExpr.ParsePiece
\r
1878 --------------------------------------------------------------}
\r
1880 function TRegExpr.ParseAtom (var flagp : integer) : PRegExprChar;
\r
1881 // the lowest level
\r
1882 // Optimization: gobbles an entire sequence of ordinary characters so that
\r
1883 // it can turn them into a single node, which is smaller to store and
\r
1884 // faster to run. Backslashed characters are exceptions, each becoming a
\r
1885 // separate node; the code is simpler that way and it's not worth fixing.
\r
1887 ret : PRegExprChar;
\r
1889 RangeBeg, RangeEnd : REChar;
\r
1890 CanBeRange : boolean;
\r
1893 begmodfs : PRegExprChar;
\r
1895 {$IFDEF UseSetOfChar} //###0.930
\r
1896 RangePCodeBeg : PRegExprChar;
\r
1897 RangePCodeIdx : integer;
\r
1898 RangeIsCI : boolean;
\r
1899 RangeSet : TSetOfREChar;
\r
1900 RangeLen : integer;
\r
1901 RangeChMin, RangeChMax : REChar;
\r
1904 procedure EmitExactly (ch : REChar);
\r
1906 if (fCompModifiers and MaskModI) = MaskModI
\r
1907 then ret := EmitNode (EXACTLYCI)
\r
1908 else ret := EmitNode (EXACTLY);
\r
1911 flagp := flagp or HASWIDTH or SIMPLE;
\r
1914 procedure EmitStr (const s : RegExprString);
\r
1917 for i := 1 to length (s)
\r
1921 function HexDig (ch : REChar) : integer;
\r
1924 if (ch >= 'a') and (ch <= 'f')
\r
1925 then ch := REChar (ord (ch) - (ord ('a') - ord ('A')));
\r
1926 if (ch < '0') or (ch > 'F') or ((ch > '9') and (ch < 'A')) then begin
\r
1927 Error (reeBadHexDigit);
\r
1930 Result := ord (ch) - ord ('0');
\r
1932 then Result := Result - (ord ('A') - ord ('9') - 1);
\r
1935 function EmitRange (AOpCode : REChar) : PRegExprChar;
\r
1937 {$IFDEF UseSetOfChar}
\r
1940 Result := EmitNode (ANYBUTTINYSET);
\r
1941 else // ANYOFCI, ANYOF
\r
1942 Result := EmitNode (ANYOFTINYSET);
\r
1945 ANYBUTCI, ANYOFCI:
\r
1946 RangeIsCI := True;
\r
1947 else // ANYBUT, ANYOF
\r
1948 RangeIsCI := False;
\r
1950 RangePCodeBeg := regcode;
\r
1951 RangePCodeIdx := regsize;
\r
1954 RangeChMin := #255;
\r
1957 Result := EmitNode (AOpCode);
\r
1959 // !!!!!!!!!!!!! Implement ANYOF[BUT]TINYSET generation for UniCode !!!!!!!!!!
\r
1963 {$IFDEF UseSetOfChar}
\r
1964 procedure EmitRangeCPrim (b : REChar); //###0.930
\r
1970 then RangeChMin := b;
\r
1972 then RangeChMax := b;
\r
1973 Include (RangeSet, b);
\r
1977 procedure EmitRangeC (b : REChar);
\r
1978 {$IFDEF UseSetOfChar}
\r
1983 CanBeRange := false;
\r
1984 {$IFDEF UseSetOfChar}
\r
1985 if b <> #0 then begin
\r
1986 EmitRangeCPrim (b); //###0.930
\r
1988 then EmitRangeCPrim (InvertCase (b)); //###0.930
\r
1991 Assert (RangeLen > 0, 'TRegExpr.ParseAtom(subroutine EmitRangeC): empty range'); // impossible, but who knows..
\r
1992 Assert (RangeChMin <= RangeChMax, 'TRegExpr.ParseAtom(subroutine EmitRangeC): RangeChMin > RangeChMax'); // impossible, but who knows..
\r
1993 if RangeLen <= TinySetLen then begin // emit "tiny set"
\r
1994 if regcode = @regdummy then begin
\r
1995 regsize := RangePCodeIdx + TinySetLen; // RangeChMin/Max !!!
\r
1998 regcode := RangePCodeBeg;
\r
1999 for Ch := RangeChMin to RangeChMax do //###0.930
\r
2000 if Ch in RangeSet then begin
\r
2005 while regcode < RangePCodeBeg + TinySetLen do begin
\r
2006 regcode^ := RangeChMax;
\r
2011 if regcode = @regdummy then begin
\r
2012 regsize := RangePCodeIdx + SizeOf (TSetOfREChar);
\r
2015 if (RangePCodeBeg - REOpSz - RENextOffSz)^ = ANYBUTTINYSET
\r
2016 then RangeSet := [#0 .. #255] - RangeSet;
\r
2017 PREOp (RangePCodeBeg - REOpSz - RENextOffSz)^ := ANYOFFULLSET;
\r
2018 regcode := RangePCodeBeg;
\r
2019 Move (RangeSet, regcode^, SizeOf (TSetOfREChar));
\r
2020 inc (regcode, SizeOf (TSetOfREChar));
\r
2028 procedure EmitSimpleRangeC (b : REChar);
\r
2032 CanBeRange := true;
\r
2035 procedure EmitRangeStr (const s : RegExprString);
\r
2038 for i := 1 to length (s)
\r
2039 do EmitRangeC (s [i]);
\r
2042 function UnQuoteChar (var APtr : PRegExprChar) : REChar; //###0.934
\r
2045 't': Result := #$9; // tab (HT/TAB)
\r
2046 'n': Result := #$a; // newline (NL)
\r
2047 'r': Result := #$d; // car.return (CR)
\r
2048 'f': Result := #$c; // form feed (FF)
\r
2049 'a': Result := #$7; // alarm (bell) (BEL)
\r
2050 'e': Result := #$1b; // escape (ESC)
\r
2051 'x': begin // hex char
\r
2054 if APtr^ = #0 then begin
\r
2055 Error (reeNoHexCodeAfterBSlashX);
\r
2058 if APtr^ = '{' then begin // \x{nnnn} //###0.936
\r
2061 if APtr^ = #0 then begin
\r
2062 Error (reeNoHexCodeAfterBSlashX);
\r
2065 if APtr^ <> '}' then begin
\r
2067 ShR (SizeOf (REChar) * 8 - 4)) and $F <> 0 then begin
\r
2068 Error (reeHexCodeAfterBSlashXTooBig);
\r
2071 Result := REChar ((Ord (Result) ShL 4) or HexDig (APtr^));
\r
2072 // HexDig will cause Error if bad hex digit found
\r
2078 Result := REChar (HexDig (APtr^));
\r
2079 // HexDig will cause Error if bad hex digit found
\r
2081 if APtr^ = #0 then begin
\r
2082 Error (reeNoHexCodeAfterBSlashX);
\r
2085 Result := REChar ((Ord (Result) ShL 4) or HexDig (APtr^));
\r
2086 // HexDig will cause Error if bad hex digit found
\r
2089 else Result := APtr^;
\r
2095 flagp := WORST; // Tentatively.
\r
2098 case (regparse - 1)^ of
\r
2099 '^': ret := EmitNode (BOL);
\r
2100 '$': ret := EmitNode (EOL);
\r
2102 if (fCompModifiers and MaskModS) = MaskModS then begin
\r
2103 ret := EmitNode (ANY);
\r
2104 flagp := flagp or HASWIDTH or SIMPLE;
\r
2106 else begin // not /s, so emit [^\n]
\r
2107 ret := EmitRange (ANYBUT);
\r
2108 EmitRangeStr (#$a);
\r
2110 flagp := flagp or HASWIDTH or SIMPLE;
\r
2113 if regparse^ = '^' then begin // Complement of range.
\r
2114 if (fCompModifiers and MaskModI) = MaskModI
\r
2115 then ret := EmitRange (ANYBUTCI)
\r
2116 else ret := EmitRange (ANYBUT);
\r
2120 if (fCompModifiers and MaskModI) = MaskModI
\r
2121 then ret := EmitRange (ANYOFCI)
\r
2122 else ret := EmitRange (ANYOF);
\r
2124 CanBeRange := false;
\r
2126 if (regparse^ = ']') then begin
\r
2127 EmitSimpleRangeC (regparse^); // []-a] -> ']' .. 'a'
\r
2131 while (regparse^ <> #0) and (regparse^ <> ']') do begin
\r
2132 if (regparse^ = '-')
\r
2133 and ((regparse + 1)^ <> #0) and ((regparse + 1)^ <> ']')
\r
2134 and CanBeRange then begin
\r
2136 RangeEnd := regparse^;
\r
2137 if RangeEnd = '\' then begin
\r
2138 {$IFDEF UniCode} //###0.935
\r
2139 if (ord ((regparse + 1)^) < 256)
\r
2140 and (char ((regparse + 1)^)
\r
2141 in ['d', 'D', 's', 'S', 'w', 'W']) then begin
\r
2143 if (regparse + 1)^ in ['d', 'D', 's', 'S', 'w', 'W'] then begin
\r
2145 EmitRangeC ('-'); // or treat as error ?!!
\r
2149 RangeEnd := UnQuoteChar (regparse);
\r
2152 // r.e.ranges extension for russian
\r
2153 if ((fCompModifiers and MaskModR) = MaskModR)
\r
2154 and (RangeBeg = RusRangeLoLow) and (RangeEnd = RusRangeLoHigh) then begin
\r
2155 EmitRangeStr (RusRangeLo);
\r
2157 else if ((fCompModifiers and MaskModR) = MaskModR)
\r
2158 and (RangeBeg = RusRangeHiLow) and (RangeEnd = RusRangeHiHigh) then begin
\r
2159 EmitRangeStr (RusRangeHi);
\r
2161 else if ((fCompModifiers and MaskModR) = MaskModR)
\r
2162 and (RangeBeg = RusRangeLoLow) and (RangeEnd = RusRangeHiHigh) then begin
\r
2163 EmitRangeStr (RusRangeLo);
\r
2164 EmitRangeStr (RusRangeHi);
\r
2166 else begin // standard r.e. handling
\r
2167 if RangeBeg > RangeEnd then begin
\r
2168 Error (reeInvalidRange);
\r
2172 EmitRangeC (RangeEnd); // prevent infinite loop if RangeEnd=$ff
\r
2173 while RangeBeg < RangeEnd do begin //###0.929
\r
2174 EmitRangeC (RangeBeg);
\r
2181 if regparse^ = '\' then begin
\r
2183 if regparse^ = #0 then begin
\r
2184 Error (reeParseAtomTrailingBackSlash);
\r
2187 case regparse^ of // r.e.extensions
\r
2188 'd': EmitRangeStr ('0123456789');
\r
2189 'w': EmitRangeStr (WordChars);
\r
2190 's': EmitRangeStr (SpaceChars);
\r
2191 else EmitSimpleRangeC (UnQuoteChar (regparse));
\r
2194 else EmitSimpleRangeC (regparse^);
\r
2199 if regparse^ <> ']' then begin
\r
2200 Error (reeUnmatchedSqBrackets);
\r
2204 flagp := flagp or HASWIDTH or SIMPLE;
\r
2207 if regparse^ = '?' then begin
\r
2208 // check for extended Perl syntax : (?..)
\r
2209 if (regparse + 1)^ = '#' then begin // (?#comment)
\r
2210 inc (regparse, 2); // find closing ')'
\r
2211 while (regparse^ <> #0) and (regparse^ <> ')')
\r
2212 do inc (regparse);
\r
2213 if regparse^ <> ')' then begin
\r
2214 Error (reeUnclosedComment);
\r
2217 inc (regparse); // skip ')'
\r
2218 ret := EmitNode (COMMENT); // comment
\r
2220 else begin // modifiers ?
\r
2221 inc (regparse); // skip '?'
\r
2222 begmodfs := regparse;
\r
2223 while (regparse^ <> #0) and (regparse^ <> ')')
\r
2224 do inc (regparse);
\r
2225 if (regparse^ <> ')')
\r
2226 or not SetModifiersInt (copy (begmodfs, 1, (regparse - begmodfs)), fCompModifiers) then begin
\r
2227 Error (reeUrecognizedModifier);
\r
2230 inc (regparse); // skip ')'
\r
2231 ret := EmitNode (COMMENT); // comment
\r
2232 // Error (reeQPSBFollowsNothing);
\r
2237 ret := ParseReg (1, flags);
\r
2238 if ret = nil then begin
\r
2242 flagp := flagp or flags and (HASWIDTH or SPSTART);
\r
2245 #0, '|', ')': begin // Supposed to be caught earlier.
\r
2246 Error (reeInternalUrp);
\r
2249 '?', '+', '*': begin
\r
2250 Error (reeQPSBFollowsNothing);
\r
2254 if regparse^ = #0 then begin
\r
2255 Error (reeTrailingBackSlash);
\r
2258 case regparse^ of // r.e.extensions
\r
2259 'd': begin // r.e.extension - any digit ('0' .. '9')
\r
2260 ret := EmitNode (ANYDIGIT);
\r
2261 flagp := flagp or HASWIDTH or SIMPLE;
\r
2263 'D': begin // r.e.extension - not digit ('0' .. '9')
\r
2264 ret := EmitNode (NOTDIGIT);
\r
2265 flagp := flagp or HASWIDTH or SIMPLE;
\r
2267 's': begin // r.e.extension - any space char
\r
2268 {$IFDEF UseSetOfChar}
\r
2269 ret := EmitRange (ANYOF);
\r
2270 EmitRangeStr (SpaceChars);
\r
2273 ret := EmitNode (ANYSPACE);
\r
2275 flagp := flagp or HASWIDTH or SIMPLE;
\r
2277 'S': begin // r.e.extension - not space char
\r
2278 {$IFDEF UseSetOfChar}
\r
2279 ret := EmitRange (ANYBUT);
\r
2280 EmitRangeStr (SpaceChars);
\r
2283 ret := EmitNode (NOTSPACE);
\r
2285 flagp := flagp or HASWIDTH or SIMPLE;
\r
2287 'w': begin // r.e.extension - any english char or '_'
\r
2288 {$IFDEF UseSetOfChar}
\r
2289 ret := EmitRange (ANYOF);
\r
2290 EmitRangeStr (WordChars);
\r
2293 ret := EmitNode (ANYLETTER);
\r
2295 flagp := flagp or HASWIDTH or SIMPLE;
\r
2297 'W': begin // r.e.extension - not english char or '_'
\r
2298 {$IFDEF UseSetOfChar}
\r
2299 ret := EmitRange (ANYBUT);
\r
2300 EmitRangeStr (WordChars);
\r
2303 ret := EmitNode (NOTLETTER);
\r
2305 flagp := flagp or HASWIDTH or SIMPLE;
\r
2307 '1' .. '9': begin //###0.936
\r
2308 if (fCompModifiers and MaskModI) = MaskModI
\r
2309 then ret := EmitNode (BSUBEXPCI)
\r
2310 else ret := EmitNode (BSUBEXP);
\r
2311 EmitC (REChar (ord (regparse^) - ord ('0')));
\r
2312 flagp := flagp or HASWIDTH or SIMPLE;
\r
2314 else EmitExactly (UnQuoteChar (regparse));
\r
2320 len := strcspn (regparse, META);
\r
2322 if regparse^ <> '{' then begin
\r
2323 Error (reeRarseAtomInternalDisaster);
\r
2326 else len := strcspn (regparse + 1, META) + 1; // bad {n,m} - compile as EXATLY
\r
2327 ender := (regparse + len)^;
\r
2329 and ((ender = '*') or (ender = '+') or (ender = '?') or (ender = '{'))
\r
2330 then dec (len); // Back off clear of ?+*{ operand.
\r
2331 flagp := flagp or HASWIDTH;
\r
2333 then flagp := flagp or SIMPLE;
\r
2334 if (fCompModifiers and MaskModI) = MaskModI
\r
2335 then ret := EmitNode (EXACTLYCI)
\r
2336 else ret := EmitNode (EXACTLY);
\r
2337 while len > 0 do begin
\r
2338 EmitC (regparse^);
\r
2343 end; { of case else}
\r
2347 end; { of function TRegExpr.ParseAtom
\r
2348 --------------------------------------------------------------}
\r
2350 function TRegExpr.GetCompilerErrorPos : integer;
\r
2353 if (regexpbeg = nil) or (regparse = nil)
\r
2354 then EXIT; // not in compiling mode ?
\r
2355 Result := regparse - regexpbeg;
\r
2356 end; { of function TRegExpr.GetCompilerErrorPos
\r
2357 --------------------------------------------------------------}
\r
2360 {=============================================================}
\r
2361 {===================== Matching section ======================}
\r
2362 {=============================================================}
\r
2364 {$IFNDEF UseSetOfChar}
\r
2365 function TRegExpr.StrScanCI (s : PRegExprChar; ch : REChar) : PRegExprChar; //###0.928 - now method of TRegExpr
\r
2367 while (s^ <> #0) and (s^ <> ch) and (s^ <> InvertCase (ch))
\r
2371 else Result := nil;
\r
2372 end; { of function TRegExpr.StrScanCI
\r
2373 --------------------------------------------------------------}
\r
2376 function TRegExpr.regrepeat (p : PRegExprChar; AMax : integer) : integer;
\r
2377 // repeatedly match something simple, report how many
\r
2379 scan : PRegExprChar;
\r
2380 opnd : PRegExprChar;
\r
2382 {Ch,} InvCh : REChar; //###0.931
\r
2383 sestart, seend : PRegExprChar; //###0.936
\r
2387 opnd := p + REOpSz + RENextOffSz; //OPERAND
\r
2388 TheMax := fInputEnd - scan;
\r
2390 then TheMax := AMax;
\r
2391 case PREOp (p)^ of
\r
2394 inc (scan, Result);
\r
2396 EXACTLY: begin // in opnd can be only ONE char !!!
\r
2397 // Ch := opnd^; // store in register //###0.931
\r
2398 while (Result < TheMax) and (opnd^ = scan^) do begin
\r
2403 EXACTLYCI: begin // in opnd can be only ONE char !!!
\r
2404 // Ch := opnd^; // store in register //###0.931
\r
2405 while (Result < TheMax) and (opnd^ = scan^) do begin // prevent unneeded InvertCase //###0.931
\r
2409 if Result < TheMax then begin //###0.931
\r
2410 InvCh := InvertCase (opnd^); // store in register
\r
2411 while (Result < TheMax) and
\r
2412 ((opnd^ = scan^) or (InvCh = scan^)) do begin
\r
2418 BSUBEXP: begin //###0.936
\r
2419 sestart := startp [ord (opnd^)];
\r
2422 seend := endp [ord (opnd^)];
\r
2427 while opnd < seend do begin
\r
2428 if (scan >= fInputEnd) or (scan^ <> opnd^)
\r
2435 UNTIL Result >= AMax;
\r
2437 BSUBEXPCI: begin //###0.936
\r
2438 sestart := startp [ord (opnd^)];
\r
2441 seend := endp [ord (opnd^)];
\r
2446 while opnd < seend do begin
\r
2447 if (scan >= fInputEnd) or
\r
2448 ((scan^ <> opnd^) and (scan^ <> InvertCase (opnd^)))
\r
2455 UNTIL Result >= AMax;
\r
2458 while (Result < TheMax) and
\r
2459 (scan^ >= '0') and (scan^ <= '9') do begin
\r
2464 while (Result < TheMax) and
\r
2465 ((scan^ < '0') or (scan^ > '9')) do begin
\r
2469 {$IFNDEF UseSetOfChar} //###0.929
\r
2471 while (Result < TheMax) and
\r
2472 // !!!!!?????? if length (fWordChars) <> 0
\r
2473 // then Pos (scan^, fWordChars)
\r
2475 ((scan^ >= 'a') and (scan^ <= 'z')
\r
2476 or (scan^ >= 'A') and (scan^ <= 'Z') or (scan^ = '_')) do begin
\r
2481 while (Result < TheMax) and
\r
2482 // !!!!!?????? if length (fWordChars) <> 0
\r
2483 // then Pos (scan^, fWordChars)
\r
2485 not ((scan^ >= 'a') and (scan^ <= 'z')
\r
2486 or (scan^ >= 'A') and (scan^ <= 'Z')
\r
2487 or (scan^ = '_')) do begin
\r
2492 while (Result < TheMax) and
\r
2493 (Pos (scan^, fSpaceChars) > 0) do begin
\r
2498 while (Result < TheMax) and
\r
2499 (Pos (scan^, fSpaceChars) <= 0) do begin
\r
2504 ANYOFTINYSET: begin
\r
2505 while (Result < TheMax) and //!!!TinySet
\r
2506 ((scan^ = opnd^) or (scan^ = (opnd + 1)^)
\r
2507 or (scan^ = (opnd + 2)^)) do begin
\r
2512 ANYBUTTINYSET: begin
\r
2513 while (Result < TheMax) and //!!!TinySet
\r
2514 (scan^ <> opnd^) and (scan^ <> (opnd + 1)^)
\r
2515 and (scan^ <> (opnd + 2)^) do begin
\r
2520 {$IFDEF UseSetOfChar} //###0.929
\r
2521 ANYOFFULLSET: begin
\r
2522 while (Result < TheMax) and
\r
2523 (scan^ in PSetOfREChar (opnd)^) do begin
\r
2530 while (Result < TheMax) and
\r
2531 (StrScan (opnd, scan^) <> nil) do begin
\r
2536 while (Result < TheMax) and
\r
2537 (StrScan (opnd, scan^) = nil) do begin
\r
2542 while (Result < TheMax) and (StrScanCI (opnd, scan^) <> nil) do begin
\r
2547 while (Result < TheMax) and (StrScanCI (opnd, scan^) = nil) do begin
\r
2552 else begin // Oh dear. Called inappropriately.
\r
2553 Result := 0; // Best compromise.
\r
2554 Error (reeRegRepeatCalledInappropriately);
\r
2559 end; { of function TRegExpr.regrepeat
\r
2560 --------------------------------------------------------------}
\r
2562 function TRegExpr.regnext (p : PRegExprChar) : PRegExprChar;
\r
2563 // dig the "next" pointer out of a node
\r
2564 var offset : TRENextOff;
\r
2566 if p = @regdummy then begin
\r
2570 offset := PRENextOff (p + REOpSz)^; //###0.933 inlined NEXT
\r
2572 then Result := nil
\r
2573 else Result := p + offset;
\r
2574 end; { of function TRegExpr.regnext
\r
2575 --------------------------------------------------------------}
\r
2577 function TRegExpr.MatchPrim (prog : PRegExprChar) : boolean;
\r
2578 // recursively matching routine
\r
2579 // Conceptually the strategy is simple: check to see whether the current
\r
2580 // node matches, call self recursively to see whether the rest matches,
\r
2581 // and then act accordingly. In practice we make some effort to avoid
\r
2582 // recursion, in particular by going through "ordinary" nodes (that don't
\r
2583 // need to know whether the rest of the match failed) by a loop instead of
\r
2586 scan : PRegExprChar; // Current node.
\r
2587 next : PRegExprChar; // Next node.
\r
2589 opnd : PRegExprChar;
\r
2591 save : PRegExprChar;
\r
2593 BracesMin, BracesMax : integer; // we use integer instead of TREBracesArg for better support */+
\r
2594 {$IFDEF ComplexBraces}
\r
2595 SavedLoopStack : array [1 .. LoopStackMax] of integer; // :(( very bad for recursion
\r
2596 SavedLoopStackIdx : integer; //###0.925
\r
2602 while scan <> nil do begin
\r
2603 len := PRENextOff (scan + 1)^; //###0.932 inlined regnext
\r
2606 else next := scan + len;
\r
2609 BOL: if reginput <> fInputStart
\r
2611 EOL: if reginput^ <> #0
\r
2619 if (reginput^ = #0) or (reginput^ < '0') or (reginput^ > '9')
\r
2624 if (reginput^ = #0) or ((reginput^ >= '0') and (reginput^ <= '9'))
\r
2628 {$IFNDEF UseSetOfChar} //###0.929
\r
2630 if (reginput^ = #0) or
\r
2631 // !!!!!?????? Pos (scan^, fWordChars)
\r
2632 not ((reginput^ >= 'a') and (reginput^ <= 'z')
\r
2633 or (reginput^ >= 'A') and (reginput^ <= 'Z')
\r
2634 or (reginput^ = '_'))
\r
2639 if (reginput^ = #0) or
\r
2640 // !!!!!?????? Pos (scan^, fWordChars)
\r
2641 (reginput^ >= 'a') and (reginput^ <= 'z')
\r
2642 or (reginput^ >= 'A') and (reginput^ <= 'Z')
\r
2643 or (reginput^ = '_')
\r
2648 if (reginput^ = #0) or not (Pos (scan^, fSpaceChars) > 0)
\r
2653 if (reginput^ = #0) or (Pos (scan^, fSpaceChars) > 0)
\r
2659 opnd := scan + REOpSz + RENextOffSz; // OPERAND
\r
2660 // Inline the first character, for speed.
\r
2661 if (opnd^ <> reginput^)
\r
2662 and (InvertCase (opnd^) <> reginput^)
\r
2664 len := strlen (opnd);
\r
2668 while no > 1 do begin
\r
2671 if (opnd^ <> save^)
\r
2672 and (InvertCase (opnd^) <> save^)
\r
2677 inc (reginput, len);
\r
2680 opnd := scan + REOpSz + RENextOffSz; // OPERAND
\r
2681 // Inline the first character, for speed.
\r
2682 if opnd^ <> reginput^
\r
2684 len := strlen (opnd);
\r
2688 while no > 1 do begin
\r
2696 inc (reginput, len);
\r
2698 BSUBEXP: begin //###0.936
\r
2699 no := ord ((scan + REOpSz + RENextOffSz)^);
\r
2700 if startp [no] = nil
\r
2702 if endp [no] = nil
\r
2705 opnd := startp [no];
\r
2706 while opnd < endp [no] do begin
\r
2707 if (save >= fInputEnd) or (save^ <> opnd^)
\r
2714 BSUBEXPCI: begin //###0.936
\r
2715 no := ord ((scan + REOpSz + RENextOffSz)^);
\r
2716 if startp [no] = nil
\r
2718 if endp [no] = nil
\r
2721 opnd := startp [no];
\r
2722 while opnd < endp [no] do begin
\r
2723 if (save >= fInputEnd) or
\r
2724 ((save^ <> opnd^) and (save^ <> InvertCase (opnd^)))
\r
2731 ANYOFTINYSET: begin
\r
2732 if (reginput^ = #0) or //!!!TinySet
\r
2733 ((reginput^ <> (scan + REOpSz + RENextOffSz)^)
\r
2734 and (reginput^ <> (scan + REOpSz + RENextOffSz + 1)^)
\r
2735 and (reginput^ <> (scan + REOpSz + RENextOffSz + 2)^))
\r
2739 ANYBUTTINYSET: begin
\r
2740 if (reginput^ = #0) or //!!!TinySet
\r
2741 (reginput^ = (scan + REOpSz + RENextOffSz)^)
\r
2742 or (reginput^ = (scan + REOpSz + RENextOffSz + 1)^)
\r
2743 or (reginput^ = (scan + REOpSz + RENextOffSz + 2)^)
\r
2747 {$IFDEF UseSetOfChar} //###0.929
\r
2748 ANYOFFULLSET: begin
\r
2749 if (reginput^ = #0)
\r
2750 or not (reginput^ in PSetOfREChar (scan + REOpSz + RENextOffSz)^)
\r
2756 if (reginput^ = #0) or (StrScan (scan + REOpSz + RENextOffSz, reginput^) = nil)
\r
2761 if (reginput^ = #0) or (StrScan (scan + REOpSz + RENextOffSz, reginput^) <> nil)
\r
2766 if (reginput^ = #0) or (StrScanCI (scan + REOpSz + RENextOffSz, reginput^) = nil)
\r
2771 if (reginput^ = #0) or (StrScanCI (scan + REOpSz + RENextOffSz, reginput^) <> nil)
\r
2779 Succ (OPEN) .. TREOp (Ord (OPEN) + NSUBEXP - 1) : begin //###0.929
\r
2780 no := ord (scan^) - ord (OPEN);
\r
2781 // save := reginput;
\r
2782 save := startp [no]; //###0.936
\r
2783 startp [no] := reginput; //###0.936
\r
2784 Result := MatchPrim (next);
\r
2785 if not Result //###0.936
\r
2786 then startp [no] := save;
\r
2787 // if Result and (startp [no] = nil)
\r
2788 // then startp [no] := save;
\r
2789 // Don't set startp if some later invocation of the same
\r
2790 // parentheses already has.
\r
2793 Succ (CLOSE) .. TREOp (Ord (CLOSE) + NSUBEXP - 1): begin //###0.929
\r
2794 no := ord (scan^) - ord (CLOSE);
\r
2795 // save := reginput;
\r
2796 save := endp [no]; //###0.936
\r
2797 endp [no] := reginput; //###0.936
\r
2798 Result := MatchPrim (next);
\r
2799 if not Result //###0.936
\r
2800 then endp [no] := save;
\r
2801 // if Result and (endp [no] = nil)
\r
2802 // then endp [no] := save;
\r
2803 // Don't set endp if some later invocation of the same
\r
2804 // parentheses already has.
\r
2808 if (next^ <> BRANCH) // No choice.
\r
2809 then next := scan + REOpSz + RENextOffSz // Avoid recursion
\r
2813 Result := MatchPrim (scan + REOpSz + RENextOffSz);
\r
2817 scan := regnext (scan);
\r
2818 UNTIL (scan = nil) or (scan^ <> BRANCH);
\r
2822 {$IFDEF ComplexBraces}
\r
2823 LOOPENTRY: begin //###0.925
\r
2824 no := LoopStackIdx;
\r
2825 inc (LoopStackIdx);
\r
2826 if LoopStackIdx > LoopStackMax then begin
\r
2827 Error (reeLoopStackExceeded);
\r
2831 LoopStack [LoopStackIdx] := 0; // init loop counter
\r
2832 Result := MatchPrim (next); // execute LOOP
\r
2833 LoopStackIdx := no; // cleanup
\r
2839 LOOP: begin //###0.925
\r
2840 if LoopStackIdx <= 0 then begin
\r
2841 Error (reeLoopWithoutEntry);
\r
2844 opnd := scan + PRENextOff (scan + REOpSz + RENextOffSz + 2 * REBracesArgSz)^;
\r
2845 BracesMin := PREBracesArg (scan + REOpSz + RENextOffSz)^;
\r
2846 BracesMax := PREBracesArg (scan + REOpSz + RENextOffSz + REBracesArgSz)^;
\r
2848 if LoopStack [LoopStackIdx] >= BracesMin then begin
\r
2850 if LoopStack [LoopStackIdx] < BracesMax then begin
\r
2851 inc (LoopStack [LoopStackIdx]);
\r
2852 no := LoopStackIdx;
\r
2853 Result := MatchPrim (opnd);
\r
2854 LoopStackIdx := no;
\r
2859 dec (LoopStackIdx);
\r
2860 Result := MatchPrim (next);
\r
2862 then reginput := save;
\r
2865 else begin // first match a min_cnt times
\r
2866 inc (LoopStack [LoopStackIdx]);
\r
2867 no := LoopStackIdx;
\r
2868 Result := MatchPrim (opnd);
\r
2869 LoopStackIdx := no;
\r
2872 dec (LoopStack [LoopStackIdx]);
\r
2878 STAR, PLUS, BRACES: begin
\r
2879 // Lookahead to avoid useless match attempts when we know
\r
2880 // what character comes next.
\r
2882 if next^ = EXACTLY
\r
2883 then nextch := (next + REOpSz + RENextOffSz)^;
\r
2884 BracesMax := MaxInt; // infinite loop for * and + //###0.92
\r
2886 then BracesMin := 0 // STAR
\r
2887 else if scan^ = PLUS
\r
2888 then BracesMin := 1 // PLUS
\r
2889 else begin // BRACES
\r
2890 BracesMin := PREBracesArg (scan + REOpSz + RENextOffSz)^;
\r
2891 BracesMax := PREBracesArg (scan + REOpSz + RENextOffSz + REBracesArgSz)^;
\r
2894 opnd := scan + REOpSz + RENextOffSz;
\r
2896 then inc (opnd, 2 * REBracesArgSz);
\r
2897 no := regrepeat (opnd, BracesMax); // don't repeat more than max_cnt
\r
2898 while no >= BracesMin do begin
\r
2899 // If it could work, try it.
\r
2900 if (nextch = #0) or (reginput^ = nextch) then begin
\r
2901 {$IFDEF ComplexBraces}
\r
2902 System.Move (LoopStack, SavedLoopStack, SizeOf (LoopStack)); //###0.925
\r
2903 SavedLoopStackIdx := LoopStackIdx;
\r
2905 if MatchPrim (next) then begin
\r
2909 {$IFDEF ComplexBraces}
\r
2910 System.Move (SavedLoopStack, LoopStack, SizeOf (LoopStack));
\r
2911 LoopStackIdx := SavedLoopStackIdx;
\r
2914 dec (no); // Couldn't or didn't - back up.
\r
2915 reginput := save + no;
\r
2920 Result := true; // Success!
\r
2924 Error (reeMatchPrimMemoryCorruption);
\r
2927 end; { of case scan^}
\r
2929 end; { of while scan <> nil}
\r
2931 // We get here only if there's trouble -- normally "case EEND" is the
\r
2932 // terminating point.
\r
2933 Error (reeMatchPrimCorruptedPointers);
\r
2934 end; { of function TRegExpr.MatchPrim
\r
2935 --------------------------------------------------------------}
\r
2937 {$IFDEF UseFirstCharSet} //###0.929
\r
2938 procedure TRegExpr.FillFirstCharSet (prog : PRegExprChar);
\r
2940 scan : PRegExprChar; // Current node.
\r
2941 next : PRegExprChar; // Next node.
\r
2942 opnd : PRegExprChar;
\r
2943 min_cnt : integer;
\r
2946 while scan <> nil do begin
\r
2947 next := regnext (scan);
\r
2948 case PREOp (scan)^ of
\r
2949 BSUBEXP, BSUBEXPCI: begin //###0.938
\r
2950 FirstCharSet := [#0 .. #255]; // :((( we cannot
\r
2951 // optimize r.e. if it starts with back reference
\r
2954 BOL: ; // EXIT; //###0.937
\r
2955 EOL: ; // EXIT; //###0.937
\r
2957 FirstCharSet := [#0 .. #255]; //###0.930
\r
2961 FirstCharSet := FirstCharSet + ['0' .. '9'];
\r
2965 FirstCharSet := [#0 .. #255] - ['0' .. '9'];
\r
2969 Include (FirstCharSet, (scan + REOpSz + RENextOffSz)^);
\r
2970 Include (FirstCharSet, InvertCase ((scan + REOpSz + RENextOffSz)^));
\r
2974 Include (FirstCharSet, (scan + REOpSz + RENextOffSz)^);
\r
2977 ANYOFFULLSET: begin
\r
2978 FirstCharSet := FirstCharSet + PSetOfREChar (scan + REOpSz + RENextOffSz)^;
\r
2981 ANYOFTINYSET: begin
\r
2983 Include (FirstCharSet, (scan + REOpSz + RENextOffSz)^);
\r
2984 Include (FirstCharSet, (scan + REOpSz + RENextOffSz + 1)^);
\r
2985 Include (FirstCharSet, (scan + REOpSz + RENextOffSz + 2)^);
\r
2986 // ... // up to TinySetLen
\r
2989 ANYBUTTINYSET: begin
\r
2991 FirstCharSet := [#0 .. #255];
\r
2992 Exclude (FirstCharSet, (scan + REOpSz + RENextOffSz)^);
\r
2993 Exclude (FirstCharSet, (scan + REOpSz + RENextOffSz + 1)^);
\r
2994 Exclude (FirstCharSet, (scan + REOpSz + RENextOffSz + 2)^);
\r
2995 // ... // up to TinySetLen
\r
3001 Succ (OPEN) .. TREOp (Ord (OPEN) + NSUBEXP - 1) : begin //###0.929
\r
3002 FillFirstCharSet (next);
\r
3005 Succ (CLOSE) .. TREOp (Ord (CLOSE) + NSUBEXP - 1): begin //###0.929
\r
3006 FillFirstCharSet (next);
\r
3010 if (PREOp (next)^ <> BRANCH) // No choice.
\r
3011 then next := scan + REOpSz + RENextOffSz // Avoid recursion.
\r
3014 FillFirstCharSet (scan + REOpSz + RENextOffSz);
\r
3015 scan := regnext (scan);
\r
3016 UNTIL (scan = nil) or (PREOp (scan)^ <> BRANCH);
\r
3020 {$IFDEF ComplexBraces}
\r
3021 LOOPENTRY: begin //###0.925
\r
3022 LoopStack [LoopStackIdx] := 0; // init loop counter
\r
3023 FillFirstCharSet (next); // execute LOOP
\r
3026 LOOP: begin //###0.925
\r
3027 opnd := scan + PRENextOff (scan + REOpSz + RENextOffSz + REBracesArgSz * 2)^;
\r
3028 min_cnt := PREBracesArg (scan + REOpSz + RENextOffSz)^;
\r
3029 FillFirstCharSet (opnd);
\r
3031 then FillFirstCharSet (next);
\r
3036 FillFirstCharSet (scan + REOpSz + RENextOffSz);
\r
3038 FillFirstCharSet (scan + REOpSz + RENextOffSz);
\r
3042 opnd := scan + REOpSz + RENextOffSz + REBracesArgSz * 2;
\r
3043 min_cnt := PREBracesArg (scan + REOpSz + RENextOffSz)^; // BRACES
\r
3044 FillFirstCharSet (opnd);
\r
3052 Error (reeMatchPrimMemoryCorruption);
\r
3055 end; { of case scan^}
\r
3057 end; { of while scan <> nil}
\r
3058 end; { of procedure FillFirstCharSet;
\r
3059 --------------------------------------------------------------}
\r
3062 function TRegExpr.RegMatch (str : PRegExprChar) : boolean;
\r
3063 // try match at specific point
\r
3066 for i := 0 to NSUBEXP - 1 do begin
\r
3067 startp [i] := nil;
\r
3071 Result := MatchPrim (programm + REOpSz);
\r
3072 if Result then begin
\r
3073 startp [0] := str;
\r
3074 endp [0] := reginput;
\r
3076 end; { of function TRegExpr.RegMatch
\r
3077 --------------------------------------------------------------}
\r
3079 function TRegExpr.Exec (const AInputString : RegExprString) : boolean;
\r
3081 InputString := AInputString;
\r
3082 Result := ExecPrim (1);
\r
3083 end; { of function TRegExpr.Exec
\r
3084 --------------------------------------------------------------}
\r
3086 function TRegExpr.ExecPrim (AOffset: integer) : boolean;
\r
3089 StartPtr: PRegExprChar;
\r
3090 InputLen : integer;
\r
3092 Result := false; // Be paranoid...
\r
3094 if not IsProgrammOk //###0.929
\r
3097 // Check InputString presence
\r
3098 if not Assigned (fInputString) then begin
\r
3099 Error (reeNoInpitStringSpecified);
\r
3103 InputLen := length (fInputString);
\r
3105 //Check that the start position is not negative
\r
3106 if AOffset < 1 then begin
\r
3107 Error (reeOffsetMustBeGreaterThen0);
\r
3110 // Check that the start position is not longer than the line
\r
3111 // If so then exit with nothing found
\r
3112 if AOffset > (InputLen + 1) // for matching empty string after last char.
\r
3115 StartPtr := fInputString + AOffset - 1;
\r
3117 // If there is a "must appear" string, look for it.
\r
3118 if regmust <> nil then begin
\r
3121 s := StrScan (s, regmust [0]);
\r
3122 if s <> nil then begin
\r
3123 if StrLComp (s, regmust, regmlen) = 0
\r
3124 then BREAK; // Found it.
\r
3128 if s = nil // Not present.
\r
3132 // Mark beginning of line for ^ .
\r
3133 fInputStart := fInputString;
\r
3135 // Pointer to end of input stream - for
\r
3136 // pascal-style string processing (may include #0)
\r
3137 fInputEnd := fInputString + InputLen;
\r
3139 {$IFDEF ComplexBraces}
\r
3140 // no loops started
\r
3141 LoopStackIdx := 0; //###0.925
\r
3144 // Simplest case: anchored match need be tried only once.
\r
3145 if reganch <> #0 then begin
\r
3146 Result := RegMatch (StartPtr);
\r
3150 // Messy cases: unanchored match.
\r
3152 if regstart <> #0 then // We know what char it must start with.
\r
3154 s := StrScan (s, regstart);
\r
3155 if s <> nil then begin
\r
3156 Result := RegMatch (s);
\r
3162 else begin // We don't - general case.
\r
3163 {$IFDEF UseFirstCharSet} //###0.929
\r
3164 while s^ <> #0 do begin
\r
3165 if s^ in FirstCharSet
\r
3166 then Result := RegMatch (s);
\r
3173 Result := RegMatch (s);
\r
3181 end; { of function TRegExpr.ExecPrim
\r
3182 --------------------------------------------------------------}
\r
3184 function TRegExpr.ExecNext : boolean;
\r
3185 var offset : integer;
\r
3188 if not Assigned (startp[0]) or not Assigned (endp[0]) then begin
\r
3189 Error (reeExecNextWithoutExec);
\r
3192 // Offset := MatchPos [0] + MatchLen [0];
\r
3193 // if MatchLen [0] = 0
\r
3194 Offset := endp [0] - fInputString + 1; //###0.929
\r
3195 if endp [0] = startp [0] //###0.929
\r
3196 then inc (Offset); // prevent infinite looping if empty string match r.e.
\r
3197 Result := ExecPrim (Offset);
\r
3198 end; { of function TRegExpr.ExecNext
\r
3199 --------------------------------------------------------------}
\r
3201 function TRegExpr.ExecPos (AOffset: integer {$IFDEF D4_}= 1{$ENDIF}) : boolean;
\r
3203 Result := ExecPrim (AOffset);
\r
3204 end; { of function TRegExpr.ExecPos
\r
3205 --------------------------------------------------------------}
\r
3207 function TRegExpr.GetInputString : RegExprString;
\r
3209 if not Assigned (fInputString) then begin
\r
3210 Error (reeGetInputStringWithoutInputString);
\r
3213 Result := fInputString;
\r
3214 end; { of function TRegExpr.GetInputString
\r
3215 --------------------------------------------------------------}
\r
3217 procedure TRegExpr.SetInputString (const AInputString : RegExprString);
\r
3222 // clear Match* - before next Exec* call it's undefined
\r
3223 for i := 0 to NSUBEXP - 1 do begin
\r
3224 startp [i] := nil;
\r
3228 // need reallocation of input string buffer ?
\r
3229 Len := length (AInputString);
\r
3230 if Assigned (fInputString) and (Length (fInputString) <> Len) then begin
\r
3231 FreeMem (fInputString);
\r
3232 fInputString := nil;
\r
3234 // buffer [re]allocation
\r
3235 if not Assigned (fInputString)
\r
3236 then GetMem (fInputString, (Len + 1) * SizeOf (REChar));
\r
3238 // copy input string into buffer
\r
3240 StrPCopy (fInputString, Copy (AInputString, 1, Len)); //###0.927
\r
3242 StrLCopy (fInputString, PRegExprChar (AInputString), Len);
\r
3246 fInputString : string;
\r
3247 fInputStart, fInputEnd : PRegExprChar;
\r
3250 fInputString := AInputString;
\r
3251 UniqueString (fInputString);
\r
3252 fInputStart := PChar (fInputString);
\r
3253 Len := length (fInputString);
\r
3254 fInputEnd := PRegExprChar (integer (fInputStart) + Len); ??
\r
3255 !! startp/endp âñå ðàâíî áóäåò îïàñíî èñïîëüçîâàòü ?
\r
3257 end; { of procedure TRegExpr.SetInputString
\r
3258 --------------------------------------------------------------}
\r
3260 function TRegExpr.Substitute (const ATemplate : RegExprString) : RegExprString;
\r
3261 // perform substitutions after a regexp match
\r
3262 // completely rewritten in 0.929
\r
3264 TemplateLen : integer;
\r
3265 TemplateBeg, TemplateEnd : PRegExprChar;
\r
3266 p, p0, ResultPtr : PRegExprChar;
\r
3267 ResultLen : integer;
\r
3270 function ParseVarName (var APtr : PRegExprChar) : integer;
\r
3271 // extract name of variable (digits, may be enclosed with
\r
3272 // curly braces) from APtr^, uses TemplateEnd !!!
\r
3274 Digits = ['0' .. '9'];
\r
3277 Delimited : boolean;
\r
3281 Delimited := (p < TemplateEnd) and (p^ = '{');
\r
3283 then inc (p); // skip left curly brace
\r
3284 if (p < TemplateEnd) and (p^ = '&')
\r
3285 then inc (p) // this is '$&' or '${&}'
\r
3287 while (p < TemplateEnd) and
\r
3288 {$IFDEF UniCode} //###0.935
\r
3289 (ord (p^) < 256) and (char (p^) in Digits)
\r
3294 inc (Result, ord (p^) - ord ('0'));
\r
3298 if (p < TemplateEnd) and (p^ = '}')
\r
3299 then inc (p) // skip right curly brace
\r
3300 else p := APtr; // isn't properly terminated
\r
3302 then Result := -1; // no valid digits found or no right curly brace
\r
3306 // Check programm and input string
\r
3307 if not IsProgrammOk
\r
3309 if not Assigned (fInputString) then begin
\r
3310 Error (reeNoInpitStringSpecified);
\r
3313 // Prepare for working
\r
3314 TemplateLen := length (ATemplate);
\r
3315 if TemplateLen = 0 then begin // prevent nil pointers
\r
3319 TemplateBeg := pointer (ATemplate);
\r
3320 TemplateEnd := TemplateBeg + TemplateLen;
\r
3321 // Count result length for speed optimization.
\r
3324 while p < TemplateEnd do begin
\r
3328 then n := ParseVarName (p)
\r
3330 if n >= 0 then begin
\r
3331 if (n < NSUBEXP) and Assigned (startp [n]) and Assigned (endp [n])
\r
3332 then inc (ResultLen, endp [n] - startp [n]);
\r
3335 if (Ch = '\') and (p < TemplateEnd)
\r
3336 then inc (p); // quoted or special char followed
\r
3340 // Get memory. We do it once and it significant speed up work !
\r
3341 if ResultLen = 0 then begin
\r
3345 SetString (Result, nil, ResultLen);
\r
3347 ResultPtr := pointer (Result);
\r
3349 while p < TemplateEnd do begin
\r
3353 then n := ParseVarName (p)
\r
3355 if n >= 0 then begin
\r
3357 if (n < NSUBEXP) and Assigned (p0) and Assigned (endp [n]) then
\r
3358 while p0 < endp [n] do begin
\r
3359 ResultPtr^ := p0^;
\r
3365 if (Ch = '\') and (p < TemplateEnd) then begin // quoted or special char followed
\r
3373 end; { of function TRegExpr.Substitute
\r
3374 --------------------------------------------------------------}
\r
3377 function TRegExpr.Substitute (const ATemplate : RegExprString) : RegExprString;
\r
3378 // perform substitutions after a regexp match
\r
3380 src : integer; // PRegExprChar; //###0.927
\r
3386 if not IsProgrammOk //###0.929
\r
3389 src := 1; // PRegExprChar (ATemplate); //###0.927
\r
3390 while src <= Length (ATemplate) { ^ <> #0} do begin //###0.927
\r
3391 c := ATemplate [src]; // src^; //###0.927
\r
3393 c2 := ATemplate [src]; //###0.927
\r
3396 else if (c = '\') and ('0' <= c2) and (c2 <= '9')
\r
3398 no := ord (c2) - ord ('0');
\r
3403 if no < 0 then begin // Ordinary character.
\r
3404 if (c = '\') and ((c2 = '\') or (c2 = '&')) then begin
\r
3408 Result := Result + c;
\r
3410 else Result := Result + Match [no]; //###0.90
\r
3412 end; { of function TRegExpr.Substitute
\r
3413 --------------------------------------------------------------}
\r
3416 procedure TRegExpr.Split (AInputStr : RegExprString; APieces : TStrings);
\r
3417 var PrevPos : integer;
\r
3420 if Exec (AInputStr) then
\r
3422 APieces.Add (System.Copy (AInputStr, PrevPos, MatchPos [0] - PrevPos));
\r
3423 PrevPos := MatchPos [0] + MatchLen [0];
\r
3424 UNTIL not ExecNext;
\r
3425 APieces.Add (System.Copy (AInputStr, PrevPos, MaxInt)); // Tail
\r
3426 end; { of procedure TRegExpr.Split
\r
3427 --------------------------------------------------------------}
\r
3429 function TRegExpr.Replace (AInputStr : RegExprString; const AReplaceStr : RegExprString) : RegExprString;
\r
3430 var PrevPos : integer;
\r
3434 if Exec (AInputStr) then
\r
3436 Result := Result + System.Copy (AInputStr, PrevPos,
\r
3437 MatchPos [0] - PrevPos) + AReplaceStr;
\r
3438 PrevPos := MatchPos [0] + MatchLen [0];
\r
3439 UNTIL not ExecNext;
\r
3440 Result := Result + System.Copy (AInputStr, PrevPos, MaxInt); // Tail
\r
3441 end; { of function TRegExpr.Replace
\r
3442 --------------------------------------------------------------}
\r
3445 {=============================================================}
\r
3446 {====================== Debug section ========================}
\r
3447 {=============================================================}
\r
3449 {$IFDEF DebugRegExpr}
\r
3450 function TRegExpr.DumpOp (op : TREOp) : RegExprString;
\r
3451 // printable representation of opcode
\r
3454 BOL: Result := 'BOL';
\r
3455 EOL: Result := 'EOL';
\r
3456 ANY: Result := 'ANY';
\r
3457 ANYLETTER: Result := 'ANYLETTER';
\r
3458 NOTLETTER: Result := 'NOTLETTER';
\r
3459 ANYDIGIT: Result := 'ANYDIGIT';
\r
3460 NOTDIGIT: Result := 'NOTDIGIT';
\r
3461 ANYSPACE: Result := 'ANYSPACE';
\r
3462 NOTSPACE: Result := 'NOTSPACE';
\r
3463 ANYOF: Result := 'ANYOF';
\r
3464 ANYBUT: Result := 'ANYBUT';
\r
3465 ANYOFCI: Result := 'ANYOF/CI';
\r
3466 ANYBUTCI: Result := 'ANYBUT/CI';
\r
3467 BRANCH: Result := 'BRANCH';
\r
3468 EXACTLY: Result := 'EXACTLY';
\r
3469 EXACTLYCI: Result := 'EXACTLY/CI';
\r
3470 NOTHING: Result := 'NOTHING';
\r
3471 COMMENT: Result := 'COMMENT';
\r
3472 BACK: Result := 'BACK';
\r
3473 EEND: Result := 'END';
\r
3474 BSUBEXP: Result := 'BSUBEXP';
\r
3475 BSUBEXPCI: Result := 'BSUBEXP/CI';
\r
3476 Succ (OPEN) .. TREOp (Ord (OPEN) + NSUBEXP - 1): //###0.929
\r
3477 Result := Format ('OPEN[%d]', [ord (op) - ord (OPEN)]);
\r
3478 Succ (CLOSE) .. TREOp (Ord (CLOSE) + NSUBEXP - 1): //###0.929
\r
3479 Result := Format ('CLOSE[%d]', [ord (op) - ord (CLOSE)]);
\r
3480 STAR: Result := 'STAR';
\r
3481 PLUS: Result := 'PLUS';
\r
3482 BRACES: Result := 'BRACES';
\r
3483 {$IFDEF ComplexBraces}
\r
3484 LOOPENTRY: Result := 'LOOPENTRY'; //###0.925
\r
3485 LOOP: Result := 'LOOP'; //###0.925
\r
3487 ANYOFTINYSET: Result:= 'ANYOFTINYSET';
\r
3488 ANYBUTTINYSET:Result:= 'ANYBUTTINYSET';
\r
3489 {$IFDEF UseSetOfChar} //###0.929
\r
3490 ANYOFFULLSET: Result:= 'ANYOFFULLSET';
\r
3492 else Error (reeDumpCorruptedOpcode);
\r
3494 Result := ':' + Result;
\r
3495 end; { of function TRegExpr.DumpOp
\r
3496 --------------------------------------------------------------}
\r
3498 function TRegExpr.Dump : RegExprString;
\r
3499 // dump a regexp in vaguely comprehensible form
\r
3502 op : TREOp; // Arbitrary non-END op.
\r
3503 next : PRegExprChar;
\r
3505 {$IFDEF UseSetOfChar} //###0.929
\r
3509 if not IsProgrammOk //###0.929
\r
3514 s := programm + REOpSz;
\r
3515 while op <> EEND do begin // While that wasn't END last time...
\r
3517 Result := Result + Format ('%2d%s', [s - programm, DumpOp (s^)]); // Where, what.
\r
3518 next := regnext (s);
\r
3519 if next = nil // Next ptr.
\r
3520 then Result := Result + ' (0)'
\r
3521 else Result := Result + Format (' (%d) ', [(s - programm) + (next - s)]);
\r
3522 inc (s, REOpSz + RENextOffSz);
\r
3523 if (op = ANYOF) or (op = ANYOFCI) or (op = ANYBUT) or (op = ANYBUTCI)
\r
3524 or (op = EXACTLY) or (op = EXACTLYCI) then begin
\r
3525 // Literal string, where present.
\r
3526 while s^ <> #0 do begin
\r
3527 Result := Result + s^;
\r
3532 if (op = ANYOFTINYSET) or (op = ANYBUTTINYSET) then begin
\r
3533 for i := 1 to TinySetLen do begin
\r
3534 Result := Result + s^;
\r
3538 if (op = BSUBEXP) or (op = BSUBEXPCI) then begin
\r
3539 Result := Result + ' \' + IntToStr (Ord (s^));
\r
3542 {$IFDEF UseSetOfChar} //###0.929
\r
3543 if op = ANYOFFULLSET then begin
\r
3544 for Ch := #0 to #255 do
\r
3545 if Ch in PSetOfREChar (s)^ then
\r
3547 then Result := Result + '#' + IntToStr (Ord (Ch)) //###0.936
\r
3548 else Result := Result + Ch;
\r
3549 inc (s, SizeOf (TSetOfREChar));
\r
3552 if (op = BRACES) then begin
\r
3553 // show min/max argument of BRACES operator
\r
3554 Result := Result + Format ('{%d,%d}', [PREBracesArg (s)^, PREBracesArg (s + REBracesArgSz)^]);
\r
3555 inc (s, REBracesArgSz * 2);
\r
3557 {$IFDEF ComplexBraces}
\r
3558 if op = LOOP then begin //###0.925
\r
3559 Result := Result + Format (' -> (%d) {%d,%d}', [
\r
3560 (s - programm - (REOpSz + RENextOffSz)) + PRENextOff (s + 2 * REBracesArgSz)^,
\r
3561 PREBracesArg (s)^, PREBracesArg (s + REBracesArgSz)^]);
\r
3562 inc (s, 2 * REBracesArgSz + RENextOffSz);
\r
3565 Result := Result + #$d#$a;
\r
3568 // Header fields of interest.
\r
3571 then Result := Result + 'start ' + regstart;
\r
3573 then Result := Result + 'anchored ';
\r
3575 then Result := Result + 'must have ' + regmust;
\r
3576 {$IFDEF UseFirstCharSet} //###0.929
\r
3577 Result := Result + #$d#$a'FirstCharSet:';
\r
3578 for Ch := #0 to #255 do
\r
3579 if Ch in FirstCharSet
\r
3580 then Result := Result + Ch;
\r
3582 Result := Result + #$d#$a;
\r
3583 end; { of function TRegExpr.Dump
\r
3584 --------------------------------------------------------------}
\r
3587 {$IFDEF reRealExceptionAddr}
\r
3588 {$OPTIMIZATION ON}
\r
3589 // ReturnAddr works correctly only if compiler optimization is ON
\r
3590 // I placed this method at very end of unit because there are no
\r
3591 // way to restore compiler optimization flag ...
\r
3593 procedure TRegExpr.Error (AErrorID : integer);
\r
3594 {$IFDEF reRealExceptionAddr}
\r
3595 function ReturnAddr : pointer; //###0.938
\r
3603 fLastError := AErrorID; // dummy stub - useless because will raise exception
\r
3604 if AErrorID < 1000 // compilation error ?
\r
3605 then e := ERegExpr.Create (ErrorMsg (AErrorID) // yes - show error pos
\r
3606 + ' (pos ' + IntToStr (CompilerErrorPos) + ')')
\r
3607 else e := ERegExpr.Create (ErrorMsg (AErrorID));
\r
3608 e.ErrorCode := AErrorID;
\r
3609 e.CompilerErrorPos := CompilerErrorPos;
\r
3611 {$IFDEF reRealExceptionAddr}
\r
3612 At ReturnAddr; //###0.938
\r
3614 end; { of procedure TRegExpr.Error
\r
3615 --------------------------------------------------------------}
\r
3617 // be carefull - placed here code will be always compiled with
\r
3618 // compiler optimization flag
\r