Skip to content

buffered output #478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
michbsd opened this issue Mar 28, 2025 · 6 comments
Open

buffered output #478

michbsd opened this issue Mar 28, 2025 · 6 comments
Labels
enhancement New feature or request question A question that has or needs further clarification

Comments

@michbsd
Copy link

michbsd commented Mar 28, 2025

Hi,

No sure if this is a bug, or just a lack of me understanding the configuration options (so bear with me)

ugrep 7.3.0 amd64-portbld-freebsd14.1 +avx512; -P:pcre2jit; -z:zlib,bzip2,lzma,lz4,zstd,brotli,bzip3,7z,tar/pax/cpio/zip

I often need to grep a stream to stdin, e.g. varnishlog or just a simple tail - and it seems I get my grep but not remainder of the line, e.g.

 varnishlog -q 'ReqURL ~ crap'  | grep --line-buffered Age
    77: -   RespHeader     Age: 0
   168: -   RespHeader     Age: 0
   259: -   RespHeader     Age
^C

as you can see the last line does not print fully, until a new line is matched..

I've tried with and without --line-buffered - no change.

@genivia-inc
Copy link
Member

I had been thinking about improving this to get rid of the artifact, but have not yet found time to work on it. A good thing now is that ugrep "understands" the difference between a line-based regex and a multi-line regex, which is internally used for certain speed optimizations.

The line will be completed, but it takes until the next match to do so or until EOF, whichever comes first. This is an artifact of the built-in multiline matching capability of ugrep. Multiline matching can only work properly if the engine can safely flush the current line and move the next without skipping a potential match that spans the current line matched to the next line and beyond (which may not have been read yet). For example, the pattern birth[ \t\n]+date that allows spaces, tabs and newlines between two words. So two lines in the input my birth date is ... his birth birth\ndate is ... cannot be fully displayed until the next line with date arrives in the buffer. The funny thing is that the regex engine doesn't "know" anything about the output, so it goes off to find the next match when there is no date in the next line, like when the input ismy birth date is ... his birth birth\nmonth is ... Therefore, the first line is still not completed after my birth data is ... until a next match is found later.

Note that by contrast, GNU grep is a line-oriented matcher, so it consumes line-by-line to find matching lines (in essence).

I don't want to restrict ugrep's --line-buffered to match online lines and not support multi-line matching. That would be confusing to users as to why multi-line matching suddenly doesn't work anymore when --line-buffered` is specified.

@genivia-inc
Copy link
Member

genivia-inc commented Mar 28, 2025

I should add that option -u (--ungroup) does flush lines immediately with --line-buffered. But lines with multiple matches on it will be displayed with each match separately (which is the purpose of -u).

@genivia-inc genivia-inc added question A question that has or needs further clarification enhancement New feature or request labels Mar 29, 2025
@genivia-inc
Copy link
Member

genivia-inc commented Apr 1, 2025

I'm not satisfied either. This should and can be improved.

At least when a non-matching line arrives in the buffer right after a line with a match that is displayed, then the complete matching line should be displayed, i.e. not dangle unfinished until another matching line arrives (much) later (as is the case right now.)

This is what I have in mind, hope it is acceptable:

  • without colors enabled, the matching line will immediately and completely displayed (no delay)
  • with colors enabled, the matching line with a color-highlighted match will immediately displayed, but is completed when the next line arrives in the buffer (one line delay)
  • with option -u, matching lines are immediately displayed like before with separate lines for each match (no delay)
  • multi-line pattern matching with patterns containing newlines affects the first point above, in that the matching line can't be completely displayed and will behave as the second point (one line delay)

Note the first point: this does not delay piping through ugrep such as tail -f file.log | ug pattern | more because ug output to a pipe is not color-highlighted.

Option --line-buffered has no effect when reading standard input from character devices and pipes. Ugrep detects standard input from character devices and pipes and will use this strategy, which it currently implements, but without the proposed improvements above.

@genivia-inc
Copy link
Member

Quick update.

The dev implementation is done, except for refactoring the code, perform additional verification and testing, and by addressing one case I am not happy with yet (it is not perfectly following the proposed points).

Again, we don't need --line-buffered (or improve it) for this use scenario. It is only needed and implicitly enabled by the TUI -Q and --pager to avoid output delays in the TUI and pager.

Ugrep is also already made smart enough to immediately show matches and flush output when matches are made on standard input from a character device or a pipe, e.g. to follow input with tail -f file.log | ug pattern for example. This is done with non-blocking IO and handlers. This is a lot faster than reading input line-by-line, or worse, reading input byte-by-byte that would be extremely slow. I'm improving this part of the code to make sure matching lines are shown without unnecessary delay.

@genivia-inc
Copy link
Member

Still not 100% happy with the ugrep update I'm working on to release. The stdin pipe to ugrep behavior should be the same as GNU grep. That's not yet the case, because ugrep may quit under certain circumstances (e.g. max matches were found), whereas GNU grep keeps draining stdin until EOF (if that ever comes). So let's do the same.

@genivia-inc
Copy link
Member

Implementation and testing are complete. Results and performance look good.

The speed of input-following, with tail -f file | ugrep pattern for example, is efficient using non-blocking reads. So we don't read byte-by-byte or line-by-line, but rather use a non-blocking read to read as much as possible until the sender waits. When the sender waits, ugrep displays the results by checking EAGAIN. This gives the illusion that ugrep follows the input byte-by-byte, but it is much faster using non-blocking reads. Because I'm using non-blocking reads and we now want to display results as much as possible, I had to refactor the SIMD acceleration implementation of the regex engine to make it a little less greedy for input.

All matching lines are displayed immediately when color is not used. With color, the matching line is completed when the next line arrives in the buffer, as I've described above. The reason is that the ugrep regex engine is not line based like GNU grpe, but is inherently multi-line matching. It would be possible to exactly replicate GNU grep, but this requires separate code just for following the input line-by-line. I think that is overkill, but if necessary it can be done.

Will release an update soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question A question that has or needs further clarification
Projects
None yet
Development

No branches or pull requests

2 participants