Skip to content

Commit d2ef9c4

Browse files
committed
Update to version 2018.08.29
1 parent 7ceefe5 commit d2ef9c4

18 files changed

+6311
-3631
lines changed

PyPI/PKG-INFO

+10-140
Large diffs are not rendered by default.

PyPI/setup.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
setup(
1818
name='regex',
19-
version='2018.06.06',
19+
version='2018.08.29',
2020
description='Alternative regular expression module, to replace re.',
2121
long_description=open(os.path.join(DOCS_DIR, 'Features.rst')).read(),
2222

README.rst

+9-139
Original file line numberDiff line numberDiff line change
@@ -124,11 +124,6 @@ Multithreading
124124

125125
The regex module releases the GIL during matching on instances of the built-in (immutable) string classes, enabling other Python threads to run concurrently. It is also possible to force the regex module to release the GIL during matching by calling the matching methods with the keyword argument ``concurrent=True``. The behaviour is undefined if the string changes during matching, so use it *only* when it is guaranteed that that won't happen.
126126

127-
Building for 64-bits
128-
--------------------
129-
130-
If the source files are built for a 64-bit target then the string positions will also be 64-bit.
131-
132127
Unicode
133128
-------
134129

@@ -141,8 +136,6 @@ Additional features
141136

142137
The issue numbers relate to the Python bug tracker, except where listed as "Hg issue".
143138

144-
* Fixed support for pickling compiled regexes (`Hg issue 195 <https://bitbucket.org/mrabarnett/mrab-regex/issues/195>`_)
145-
146139
* Added support for lookaround in conditional pattern (`Hg issue 163 <https://bitbucket.org/mrabarnett/mrab-regex/issues/163>`_)
147140

148141
The test of a conditional pattern can now be a lookaround.
@@ -339,20 +332,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
339332
>>> regex.sub('(?V1).*?', '|', 'test')
340333
'|||||||||'
341334

342-
* re.group() should never return a bytearray (`issue #18468 <https://bugs.python.org/issue18468>`_)
343-
344-
For compatibility with the re module, the regex module returns all matching bytestrings as ``bytes``, starting from Python 3.4.
345-
346-
Examples:
347-
348-
.. sourcecode:: python
349-
350-
>>> regex.match(b'.', bytearray(b'a')).group()
351-
# Python 3.4 and later
352-
b'a'
353-
# Python 3.3 and earlier
354-
bytearray(b'a')
355-
356335
* Added ``capturesdict`` (`Hg issue 86 <https://bitbucket.org/mrabarnett/mrab-regex/issues/86>`_)
357336

358337
``capturesdict`` is a combination of ``groupdict`` and ``captures``:
@@ -480,7 +459,7 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
480459

481460
* Detach searched string
482461

483-
A match object contains a reference to the string that was searched, via its ``string`` attribute. The match object now has a ``detach_string`` method that will 'detach' that string, making it available for garbage collection (this might save valuable memory if that string is very large).
462+
A match object contains a reference to the string that was searched, via its ``string`` attribute. The ``detach_string`` method will 'detach' that string, making it available for garbage collection, which might save valuable memory if that string is very large.
484463

485464
Example:
486465

@@ -497,10 +476,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
497476
>>> print(m.string)
498477
None
499478

500-
* Characters in a group name (`issue #14462 <https://bugs.python.org/issue14462>`_)
501-
502-
A group name can now contain the same characters as an identifier. These are different in Python 2 and Python 3.
503-
504479
* Recursive patterns (`Hg issue 27 <https://bitbucket.org/mrabarnett/mrab-regex/issues/27>`_)
505480

506481
Recursive and repeated patterns are supported.
@@ -526,37 +501,10 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
526501

527502
It's possible to backtrack into a recursed or repeated group.
528503

529-
You can't call a group if there is more than one group with that group name or group number (``"ambiguous group reference"``). For example, ``(?P<foo>\w+) (?P<foo>\w+) (?&foo)?`` has 2 groups called "foo" (both group 1) and ``(?|([A-Z]+)|([0-9]+)) (?1)?`` has 2 groups with group number 1.
504+
You can't call a group if there is more than one group with that group name or group number (``"ambiguous group reference"``).
530505

531506
The alternative forms ``(?P>name)`` and ``(?P&name)`` are also supported.
532507

533-
* repr(regex) doesn't include actual regex (`issue #13592 <https://bugs.python.org/issue13592>`_)
534-
535-
The repr of a compiled regex is now in the form of a eval-able string. For example:
536-
537-
.. sourcecode:: python
538-
539-
>>> r = regex.compile("foo", regex.I)
540-
>>> repr(r)
541-
"regex.Regex('foo', flags=regex.I | regex.V0)"
542-
>>> r
543-
regex.Regex('foo', flags=regex.I | regex.V0)
544-
545-
The regex module has Regex as an alias for the 'compile' function.
546-
547-
* Improve the repr for regular expression match objects (`issue #17087 <https://bugs.python.org/issue17087>`_)
548-
549-
The repr of a match object is now a more useful form. For example:
550-
551-
.. sourcecode:: python
552-
553-
>>> regex.search(r"\d+", "abc012def")
554-
<regex.Match object; span=(3, 6), match='012'>
555-
556-
* Python lib re cannot handle Unicode properly due to narrow/wide bug (`issue #12729 <https://bugs.python.org/issue12729>`_)
557-
558-
The source code of the regex module has been updated to support PEP 393 ("Flexible String Representation"), which is new in Python 3.3.
559-
560508
* Full Unicode case-folding is supported.
561509

562510
In version 1 behaviour, the regex module uses full case-folding when performing case-insensitive matches in Unicode.
@@ -608,16 +556,10 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
608556

609557
In the following examples I'll omit the item and write only the fuzziness:
610558

611-
* ``{i<=3}`` permit at most 3 insertions, but no other types
612-
613559
* ``{d<=3}`` permit at most 3 deletions, but no other types
614560

615-
* ``{s<=3}`` permit at most 3 substitutions, but no other types
616-
617561
* ``{i<=1,s<=2}`` permit at most 1 insertion and at most 2 substitutions, but no deletions
618562

619-
* ``{e<=3}`` permit at most 3 errors
620-
621563
* ``{1<=e<=3}`` permit at least 1 and at most 3 errors
622564

623565
* ``{i<=2,d<=2,e<=3}`` permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions
@@ -630,25 +572,19 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
630572

631573
* ``{i<=1,d<=1,s<=1,2i+2d+1s<=4}`` at most 1 insertion, at most 1 deletion, at most 1 substitution; each insertion costs 2, each deletion costs 2, each substitution costs 1, the total cost must not exceed 4
632574

633-
You can also use "<" instead of "<=" if you want an exclusive minimum or maximum:
634-
635-
* ``{e<=3}`` permit up to 3 errors
636-
637-
* ``{e<4}`` permit fewer than 4 errors
638-
639-
* ``{0<e<4}`` permit more than 0 but fewer than 4 errors
575+
You can also use "<" instead of "<=" if you want an exclusive minimum or maximum.
640576

641577
By default, fuzzy matching searches for the first match that meets the given constraints. The ``ENHANCEMATCH`` flag will cause it to attempt to improve the fit (i.e. reduce the number of errors) of the match that it has found.
642578

643579
The ``BESTMATCH`` flag will make it search for the best match instead.
644580

645581
Further examples to note:
646582

647-
* ``regex.search("(dog){e}", "cat and dog")[1]`` returns ``"cat"`` because that matches ``"dog"`` with 3 errors, which is within the limit (an unlimited number of errors is permitted).
583+
* ``regex.search("(dog){e}", "cat and dog")[1]`` returns ``"cat"`` because that matches ``"dog"`` with 3 errors (an unlimited number of errors is permitted).
648584

649-
* ``regex.search("(dog){e<=1}", "cat and dog")[1]`` returns ``" dog"`` (with a leading space) because that matches ``"dog"`` with 1 error, which is within the limit (1 error is permitted).
585+
* ``regex.search("(dog){e<=1}", "cat and dog")[1]`` returns ``" dog"`` (with a leading space) because that matches ``"dog"`` with 1 error, which is within the limit.
650586

651-
* ``regex.search("(?e)(dog){e<=1}", "cat and dog")[1]`` returns ``"dog"`` (without a leading space) because the fuzzy search matches ``" dog"`` with 1 error, which is within the limit (1 error is permitted), and the ``(?e)`` then makes it attempt a better fit.
587+
* ``regex.search("(?e)(dog){e<=1}", "cat and dog")[1]`` returns ``"dog"`` (without a leading space) because the fuzzy search matches ``" dog"`` with 1 error, which is within the limit, and the ``(?e)`` then it attempts a better fit.
652588

653589
In the first two examples there are perfect matches later in the string, but in neither case is it the first possible match.
654590

@@ -716,7 +652,7 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
716652

717653
>>> p = regex.compile(r"first|second|third|fourth|fifth")
718654

719-
but if the list is large, parsing the resulting regex can take considerable time, and care must also be taken that the strings are properly escaped if they contain any character that has a special meaning in a regex, and that if there is a shorter string that occurs initially in a longer string that the longer string is listed before the shorter one, for example, "cats" before "cat".
655+
but if the list is large, parsing the resulting regex can take considerable time, and care must also be taken that the strings are properly escaped and properly ordered, for example, "cats" before "cat".
720656

721657
The new alternative is to use a named list:
722658

@@ -745,7 +681,7 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
745681

746682
* Unicode line separators
747683

748-
Normally the only line separator is ``\n`` (``\x0A``), but if the ``WORD`` flag is turned on then the line separators are the pair ``\x0D\x0A``, and ``\x0A``, ``\x0B``, ``\x0C`` and ``\x0D``, plus ``\x85``, ``\u2028`` and ``\u2029`` when working with Unicode.
684+
Normally the only line separator is ``\n`` (``\x0A``), but if the ``WORD`` flag is turned on then the line separators are ``\x0D\x0A``, ``\x0A``, ``\x0B``, ``\x0C`` and ``\x0D``, plus ``\x85``, ``\u2028`` and ``\u2029`` when working with Unicode.
749685

750686
This affects the regex dot ``"."``, which, with the ``DOTALL`` flag turned off, matches any character except a line separator. It also affects the line anchors ``^`` and ``$`` (in multiline mode).
751687

@@ -791,8 +727,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
791727

792728
.. sourcecode:: python
793729

794-
>>> regex.escape("foo!?")
795-
'foo!\\?'
796730
>>> regex.escape("foo!?", special_only=False)
797731
'foo\\!\\?'
798732
>>> regex.escape("foo!?", special_only=True)
@@ -806,8 +740,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
806740

807741
.. sourcecode:: python
808742

809-
>>> regex.escape("foo bar!?")
810-
'foo\\ bar!\\?'
811743
>>> regex.escape("foo bar!?", literal_spaces=False)
812744
'foo\\ bar!\\?'
813745
>>> regex.escape("foo bar!?", literal_spaces=True)
@@ -873,42 +805,14 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
873805

874806
The flags will apply only to the subpattern. Flags can be turned on or off.
875807

876-
* Inline flags (`issue #433024 <https://bugs.python.org/issue433024>`_, `issue #433027 <https://bugs.python.org/issue433027>`_)
877-
878-
``(?flags-flags)``
879-
880-
Version 0 behaviour: the flags apply to the entire pattern, and they can't be turned off.
881-
882-
Version 1 behaviour: the flags apply to the end of the group or pattern, and they can be turned on or off.
883-
884-
* Repeated repeats (`issue #2537 <https://bugs.python.org/issue2537>`_)
885-
886-
A regex like ``((x|y+)*)*`` will be accepted and will work correctly, but should complete more quickly.
887-
888808
* Definition of 'word' character (`issue #1693050 <https://bugs.python.org/issue1693050>`_)
889809

890-
The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at ``http://www.unicode.org/reports/tr29/``. This applies to ``\w``, ``\W``, ``\b`` and ``\B``.
891-
892-
* Groups in lookahead and lookbehind (`issue #814253 <https://bugs.python.org/issue814253>`_)
893-
894-
Groups and group references are permitted in both lookahead and lookbehind.
810+
The definition of a 'word' character has been expanded for Unicode. It now conforms to the Unicode specification at ``http://www.unicode.org/reports/tr29/``.
895811

896812
* Variable-length lookbehind
897813

898814
A lookbehind can match a variable-length string.
899815

900-
* Correct handling of charset with ignore case flag (`issue #3511 <https://bugs.python.org/issue3511>`_)
901-
902-
Ranges within charsets are handled correctly when the ignore-case flag is turned on.
903-
904-
* Unmatched group in replacement (`issue #1519638 <https://bugs.python.org/issue1519638>`_)
905-
906-
An unmatched group is treated as an empty string in a replacement template.
907-
908-
* 'Pathological' patterns (`issue #1566086 <https://bugs.python.org/issue1566086>`_, `issue #1662581 <https://bugs.python.org/issue1662581>`_, `issue #1448325 <https://bugs.python.org/issue1448325>`_, `issue #1721518 <https://bugs.python.org/issue1721518>`_, `issue #1297193 <https://bugs.python.org/issue1297193>`_)
909-
910-
'Pathological' patterns should complete more quickly.
911-
912816
* Flags argument for regex.split, regex.sub and regex.subn (`issue #3482 <https://bugs.python.org/issue3482>`_)
913817

914818
``regex.split``, ``regex.sub`` and ``regex.subn`` support a 'flags' argument.
@@ -921,24 +825,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
921825

922826
``regex.findall`` and ``regex.finditer`` support an 'overlapped' flag which permits overlapped matches.
923827

924-
* Unicode escapes (`issue #3665 <https://bugs.python.org/issue3665>`_)
925-
926-
The Unicode escapes ``\uxxxx`` and ``\Uxxxxxxxx`` are supported.
927-
928-
* Large patterns (`issue #1160 <https://bugs.python.org/issue1160>`_)
929-
930-
Patterns can be much larger.
931-
932-
* Zero-width match with regex.finditer (`issue #1647489 <https://bugs.python.org/issue1647489>`_)
933-
934-
``regex.finditer`` behaves correctly when it splits at a zero-width match.
935-
936-
* Zero-width split with regex.split (`issue #3262 <https://bugs.python.org/issue3262>`_)
937-
938-
Version 0 behaviour: same as re module (no split before Python 3.7).
939-
940-
Version 1 behaviour: a string can be split at a zero-width match.
941-
942828
* Splititer
943829

944830
``regex.splititer`` has been added. It's a generator equivalent of ``regex.split``.
@@ -952,10 +838,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
952838
>>> m = regex.search(r"(?P<before>.*?)(?P<num>\d+)(?P<after>.*)", "pqr123stu")
953839
>>> print(m["before"])
954840
pqr
955-
>>> print(m["num"])
956-
123
957-
>>> print(m["after"])
958-
stu
959841
>>> print(len(m))
960842
4
961843
>>> print(m[:])
@@ -985,8 +867,6 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
985867

986868
* ``Latin``, the 'Latin' script (``Script=Latin``).
987869

988-
* ``Cyrillic``, the 'Cyrillic' script (``Script=Cyrillic``).
989-
990870
* ``BasicLatin``, the 'BasicLatin' block (``Block=BasicLatin``).
991871

992872
* ``Alphabetic``, the 'Alphabetic' binary property (``Alphabetic=Yes``).
@@ -995,16 +875,12 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
995875

996876
* ``IsLatin``, the 'Latin' script (``Script=Latin``).
997877

998-
* ``IsCyrillic``, the 'Cyrillic' script (``Script=Cyrillic``).
999-
1000878
* ``IsAlphabetic``, the 'Alphabetic' binary property (``Alphabetic=Yes``).
1001879

1002880
A short form starting with ``In`` indicates a block property:
1003881

1004882
* ``InBasicLatin``, the 'BasicLatin' block (``Block=BasicLatin``).
1005883

1006-
* ``InCyrillic``, the 'Cyrillic' block (``Block=Cyrillic``).
1007-
1008884
* POSIX character classes
1009885

1010886
``[[:alpha:]]``; ``[[:^alpha:]]``
@@ -1088,9 +964,3 @@ The issue numbers relate to the Python bug tracker, except where listed as "Hg i
1088964
* Default Unicode word boundary
1089965

1090966
The ``WORD`` flag changes the definition of a 'word boundary' to that of a default Unicode word boundary. This applies to ``\b`` and ``\B``.
1091-
1092-
* SRE engine do not release the GIL (`issue #1366311 <https://bugs.python.org/issue1366311>`_)
1093-
1094-
The regex module can release the GIL during matching (see the above section on multithreading).
1095-
1096-
Iterators can be safely shared across threads.

0 commit comments

Comments
 (0)