Skip to content

Commit cfa1a23

Browse files
borkmannanakryiko
authored andcommitted
bpf: Fix overrunning reservations in ringbuf
The BPF ring buffer internally is implemented as a power-of-2 sized circular buffer, with two logical and ever-increasing counters: consumer_pos is the consumer counter to show which logical position the consumer consumed the data, and producer_pos which is the producer counter denoting the amount of data reserved by all producers. Each time a record is reserved, the producer that "owns" the record will successfully advance producer counter. In user space each time a record is read, the consumer of the data advanced the consumer counter once it finished processing. Both counters are stored in separate pages so that from user space, the producer counter is read-only and the consumer counter is read-write. One aspect that simplifies and thus speeds up the implementation of both producers and consumers is how the data area is mapped twice contiguously back-to-back in the virtual memory, allowing to not take any special measures for samples that have to wrap around at the end of the circular buffer data area, because the next page after the last data page would be first data page again, and thus the sample will still appear completely contiguous in virtual memory. Each record has a struct bpf_ringbuf_hdr { u32 len; u32 pg_off; } header for book-keeping the length and offset, and is inaccessible to the BPF program. Helpers like bpf_ringbuf_reserve() return `(void *)hdr + BPF_RINGBUF_HDR_SZ` for the BPF program to use. Bing-Jhong and Muhammad reported that it is however possible to make a second allocated memory chunk overlapping with the first chunk and as a result, the BPF program is now able to edit first chunk's header. For example, consider the creation of a BPF_MAP_TYPE_RINGBUF map with size of 0x4000. Next, the consumer_pos is modified to 0x3000 /before/ a call to bpf_ringbuf_reserve() is made. This will allocate a chunk A, which is in [0x0,0x3008], and the BPF program is able to edit [0x8,0x3008]. Now, lets allocate a chunk B with size 0x3000. This will succeed because consumer_pos was edited ahead of time to pass the `new_prod_pos - cons_pos > rb->mask` check. Chunk B will be in range [0x3008,0x6010], and the BPF program is able to edit [0x3010,0x6010]. Due to the ring buffer memory layout mentioned earlier, the ranges [0x0,0x4000] and [0x4000,0x8000] point to the same data pages. This means that chunk B at [0x4000,0x4008] is chunk A's header. bpf_ringbuf_submit() / bpf_ringbuf_discard() use the header's pg_off to then locate the bpf_ringbuf itself via bpf_ringbuf_restore_from_rec(). Once chunk B modified chunk A's header, then bpf_ringbuf_commit() refers to the wrong page and could cause a crash. Fix it by calculating the oldest pending_pos and check whether the range from the oldest outstanding record to the newest would span beyond the ring buffer size. If that is the case, then reject the request. We've tested with the ring buffer benchmark in BPF selftests (./benchs/run_bench_ringbufs.sh) before/after the fix and while it seems a bit slower on some benchmarks, it is still not significantly enough to matter. Fixes: 457f443 ("bpf: Implement BPF ring buffer and verifier support for it") Reported-by: Bing-Jhong Billy Jheng <[email protected]> Reported-by: Muhammad Ramdhan <[email protected]> Co-developed-by: Bing-Jhong Billy Jheng <[email protected]> Co-developed-by: Andrii Nakryiko <[email protected]> Signed-off-by: Bing-Jhong Billy Jheng <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
1 parent 2673315 commit cfa1a23

File tree

1 file changed

+25
-6
lines changed

1 file changed

+25
-6
lines changed

kernel/bpf/ringbuf.c

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,8 @@ struct bpf_ringbuf {
5151
* This prevents a user-space application from modifying the
5252
* position and ruining in-kernel tracking. The permissions of the
5353
* pages depend on who is producing samples: user-space or the
54-
* kernel.
54+
* kernel. Note that the pending counter is placed in the same
55+
* page as the producer, so that it shares the same cache line.
5556
*
5657
* Kernel-producer
5758
* ---------------
@@ -70,6 +71,7 @@ struct bpf_ringbuf {
7071
*/
7172
unsigned long consumer_pos __aligned(PAGE_SIZE);
7273
unsigned long producer_pos __aligned(PAGE_SIZE);
74+
unsigned long pending_pos;
7375
char data[] __aligned(PAGE_SIZE);
7476
};
7577

@@ -179,6 +181,7 @@ static struct bpf_ringbuf *bpf_ringbuf_alloc(size_t data_sz, int numa_node)
179181
rb->mask = data_sz - 1;
180182
rb->consumer_pos = 0;
181183
rb->producer_pos = 0;
184+
rb->pending_pos = 0;
182185

183186
return rb;
184187
}
@@ -404,9 +407,9 @@ bpf_ringbuf_restore_from_rec(struct bpf_ringbuf_hdr *hdr)
404407

405408
static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
406409
{
407-
unsigned long cons_pos, prod_pos, new_prod_pos, flags;
408-
u32 len, pg_off;
410+
unsigned long cons_pos, prod_pos, new_prod_pos, pend_pos, flags;
409411
struct bpf_ringbuf_hdr *hdr;
412+
u32 len, pg_off, tmp_size, hdr_len;
410413

411414
if (unlikely(size > RINGBUF_MAX_RECORD_SZ))
412415
return NULL;
@@ -424,13 +427,29 @@ static void *__bpf_ringbuf_reserve(struct bpf_ringbuf *rb, u64 size)
424427
spin_lock_irqsave(&rb->spinlock, flags);
425428
}
426429

430+
pend_pos = rb->pending_pos;
427431
prod_pos = rb->producer_pos;
428432
new_prod_pos = prod_pos + len;
429433

430-
/* check for out of ringbuf space by ensuring producer position
431-
* doesn't advance more than (ringbuf_size - 1) ahead
434+
while (pend_pos < prod_pos) {
435+
hdr = (void *)rb->data + (pend_pos & rb->mask);
436+
hdr_len = READ_ONCE(hdr->len);
437+
if (hdr_len & BPF_RINGBUF_BUSY_BIT)
438+
break;
439+
tmp_size = hdr_len & ~BPF_RINGBUF_DISCARD_BIT;
440+
tmp_size = round_up(tmp_size + BPF_RINGBUF_HDR_SZ, 8);
441+
pend_pos += tmp_size;
442+
}
443+
rb->pending_pos = pend_pos;
444+
445+
/* check for out of ringbuf space:
446+
* - by ensuring producer position doesn't advance more than
447+
* (ringbuf_size - 1) ahead
448+
* - by ensuring oldest not yet committed record until newest
449+
* record does not span more than (ringbuf_size - 1)
432450
*/
433-
if (new_prod_pos - cons_pos > rb->mask) {
451+
if (new_prod_pos - cons_pos > rb->mask ||
452+
new_prod_pos - pend_pos > rb->mask) {
434453
spin_unlock_irqrestore(&rb->spinlock, flags);
435454
return NULL;
436455
}

0 commit comments

Comments
 (0)