Summary: | Search-Replace: Regular Expression engine fails on zero length matches | ||
---|---|---|---|
Product: | LibreOffice | Reporter: | masz0 |
Component: | LibreOffice | Assignee: | Not Assigned <libreoffice-bugs> |
Status: | NEW --- | ||
Severity: | enhancement | CC: | andreas.heinisch, buzea.bogdan, chris, edier88, heiko.tietze, himajin100000, jim.avera, michael.warner.ut+libreoffice, salek.talangi, xiscofauli |
Priority: | medium | ||
Version: | 3.3.0 release | ||
Hardware: | All | ||
OS: | All | ||
See Also: |
https://bz.apache.org/ooo/show_bug.cgi?id=118887 https://bugs.documentfoundation.org/show_bug.cgi?id=58744 https://bugs.documentfoundation.org/show_bug.cgi?id=38551 |
||
Whiteboard: | |||
Crash report or crash signature: | Regression By: | ||
Bug Depends on: | |||
Bug Blocks: | 146076 |
Description
masz0
2020-08-07 15:51:56 UTC
I am able to confirm this in: Version: 6.0.7.3 Build ID: 1:6.0.7-0ubuntu0.18.04.10 CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3; Locale: en-US (en_US.UTF-8); Calc: group I didn't trace through it while executing, so I may be looking at the wrong place for this particular test case, but core/i18npool/source/search/textsearch.cxx lines 942-952 state explicitly that they are there to ignore zero-length matches. The specific comment is this: // #i118887# ignore zero-length matches e.g. "a*" in "bc" It was a decision made in OpenOffice (I added the link to their bug in the See Also field). So this is intended behavior to avoid the matching-every-position case, not a bug. Whether it should be intended behavior and how to address it is another question. Personally, I tend to think that users searching for regular expressions are knowledgeable about the regex pattern they are providing (or should be) and therefore we should match the pattern as written. *** Bug 52504 has been marked as a duplicate of this bug. *** *** Bug 132870 has been marked as a duplicate of this bug. *** IIUC, the original request was to find digits like ABC1EFG per "\d *". Works for me with and without the code around nStartOfs/nEndOfs returning "Search key not found" for ABC-EFG. Don't see much benefit from adding a note about zero-length matches to the UI; although it's easy to implement and unobtrusively replacing the "Search key not found" label. Point is that you get the zero result anyway. But no objection to implement this. (In reply to Heiko Tietze from comment #4) > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works If I am not mistaken, "\d *" has a minimum length of one (a single digit), so is not an example of this bug. Trying to match "\d*" instead would have zero length. (In reply to Michael Warner from comment #5) > (In reply to Heiko Tietze from comment #4) > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works > > If I am not mistaken, "\d *" has a minimum length of one (a single digit), > so is not an example of this bug. Trying to match "\d*" instead would have > zero length. But searching for "\d*" would match everywhere is not actually that useful. Where allowing zero-length matches would be useful is with anchors like in the original request of this bug or the other ones linked in the see also section. (In reply to Michael Warner from comment #5) > (In reply to Heiko Tietze from comment #4) > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works > > If I am not mistaken, "\d *" has a minimum length of one (a single digit), > so is not an example of this bug. Trying to match "\d*" instead would have > zero length. No, "\d *" tries to match for 1 digit, followed by 0+ spaces. (In reply to masz0 from comment #7) > (In reply to Michael Warner from comment #5) > > (In reply to Heiko Tietze from comment #4) > > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works > > > > If I am not mistaken, "\d *" has a minimum length of one (a single digit), > > so is not an example of this bug. Trying to match "\d*" instead would have > > zero length. > > No, "\d *" tries to match for 1 digit, followed by 0+ spaces. Which is what I was trying to say. At any rate, I don't think it is a valid test case for the bug you reported, please correct me if I am wrong. (In reply to Michael Warner from comment #8) > (In reply to masz0 from comment #7) > > (In reply to Michael Warner from comment #5) > > > (In reply to Heiko Tietze from comment #4) > > > > IIUC, the original request was to find digits like ABC1EFG per "\d *". Works > > > > > > If I am not mistaken, "\d *" has a minimum length of one (a single digit), > > > so is not an example of this bug. Trying to match "\d*" instead would have > > > zero length. > > > > No, "\d *" tries to match for 1 digit, followed by 0+ spaces. > > Which is what I was trying to say. At any rate, I don't think it is a valid > test case for the bug you reported, please correct me if I am wrong. Oh, sorry, I misunderstood. Affirmative for "\d *" being an invalid test. Since it requires and matches one digit, not having any in the input (ABC-EFG) will make it fail (legitimately; not thru the artificial limitation). If the input does have digits (ABC1EFG), the pattern will match each in turn. The matches will be length 1 (or more where followed one or more spaces) - therefore LO won't discard them. My problem was specifically about zero-width assertions "(?<=..)", "(?<!..)", "(?=..)", "(?!..)", "^", and combinations of them. Unlike them, standalone "X*" isn't very useful even though it too can be zero-length. Whatever the best example is, if someone volunteers, the label can be used without deteriorating effect on usability to give feedback. Dear Michael Warner, This bug has been in ASSIGNED status for more than 3 months without any activity. Resetting it to NEW. Please assign it back to yourself if you're still working on this. *** Bug 145774 has been marked as a duplicate of this bug. *** *** Bug 145856 has been marked as a duplicate of this bug. *** Hi, same behaviour here with version 7.2.2.2: Version: 7.2.2.2 Build ID: 20(Build:2) CPU threads: 4; OS: Linux 5.14; UI render: default; VCL: gtk3 Locale: en-GB (en_GB.UTF-8); UI: en-US Calc: threaded After trying to find matches with regular expression '^' in order to put a single quote at the beginning of each cell, Calc will say that there is no match. That is a wrong behaviour, as ^ is a valid regular expression for matching beginnings of strings. *** Bug 145774 has been marked as a duplicate of this bug. *** *** Bug 160118 has been marked as a duplicate of this bug. *** A code pointer: TextSearch::RESrchFrwrd [1] uses a loop "until there is a valid match", which is not ended for a valid zero-length match. It must be investigated, why these matches are ignored. [1] https://opengrok.libreoffice.org/xref/core/i18npool/source/search/textsearch.cxx?r=6182f236#922 |