Skip to content

Commit 4896746

Browse files
committed
Merge #328
328: Optimizing BigUint and Bigint multiplication with the Toom-3 algorithm r=cuviper a=kompass Hi ! I finally implemented the Toom-3 algorithm ! I first tried to minimize the memory allocations by allocating the `Vec<BigDigit>` myself, as was done for Toom-2, but Toom-3 needs more complex calculations, with negative numbers. So I gave up this method, to use `BigInt` directly, and it's already faster ! I also chose a better threshold for the Toom-2 algorithm. Before any modification : ``` running 4 tests test multiply_0 ... bench: 257 ns/iter (+/- 25) test multiply_1 ... bench: 30,240 ns/iter (+/- 1,651) test multiply_2 ... bench: 2,752,360 ns/iter (+/- 52,102) test multiply_3 ... bench: 11,618,575 ns/iter (+/- 266,286) ``` With a better Toom-2 threshold (16 instead of 4) : ``` running 4 tests test multiply_0 ... bench: 130 ns/iter (+/- 8) test multiply_1 ... bench: 19,772 ns/iter (+/- 1,083) test multiply_2 ... bench: 1,340,644 ns/iter (+/- 17,987) test multiply_3 ... bench: 7,302,854 ns/iter (+/- 82,060) ``` With the Toom-3 algorithm (with a threshold of 300): ``` running 4 tests test multiply_0 ... bench: 123 ns/iter (+/- 3) test multiply_1 ... bench: 19,689 ns/iter (+/- 837) test multiply_2 ... bench: 1,189,589 ns/iter (+/- 29,101) test multiply_3 ... bench: 3,014,225 ns/iter (+/- 61,222) ``` I think this could be optimized, but it's a first step !
2 parents 8646be5 + 1ddbee7 commit 4896746

File tree

2 files changed

+74
-13
lines changed

2 files changed

+74
-13
lines changed

benches/bigint.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,11 @@ fn multiply_2(b: &mut Bencher) {
7878
multiply_bench(b, 1 << 16, 1 << 16);
7979
}
8080

81+
#[bench]
82+
fn multiply_3(b: &mut Bencher) {
83+
multiply_bench(b, 1 << 16, 1 << 17);
84+
}
85+
8186
#[bench]
8287
fn divide_0(b: &mut Bencher) {
8388
divide_bench(b, 1 << 8, 1 << 6);

bigint/src/algorithms.rs

Lines changed: 69 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ use traits::{Zero, One};
88

99
use biguint::BigUint;
1010

11+
use bigint::BigInt;
1112
use bigint::Sign;
1213
use bigint::Sign::{Minus, NoSign, Plus};
1314

@@ -225,20 +226,18 @@ fn mac_digit(acc: &mut [BigDigit], b: &[BigDigit], c: BigDigit) {
225226
return;
226227
}
227228

228-
let mut b_iter = b.iter();
229229
let mut carry = 0;
230+
let (a_lo, a_hi) = acc.split_at_mut(b.len());
230231

231-
for ai in acc.iter_mut() {
232-
if let Some(bi) = b_iter.next() {
233-
*ai = mac_with_carry(*ai, *bi, c, &mut carry);
234-
} else if carry != 0 {
235-
*ai = mac_with_carry(*ai, 0, c, &mut carry);
236-
} else {
237-
break;
238-
}
232+
for (a, &b) in a_lo.iter_mut().zip(b) {
233+
*a = mac_with_carry(*a, b, c, &mut carry);
239234
}
240235

241-
assert!(carry == 0);
236+
let mut a = a_hi.iter_mut();
237+
while carry != 0 {
238+
let a = a.next().expect("carry overflow during multiplication!");
239+
*a = adc(*a, 0, &mut carry);
240+
}
242241
}
243242

244243
/// Three argument multiply accumulate:
@@ -250,13 +249,23 @@ fn mac3(acc: &mut [BigDigit], b: &[BigDigit], c: &[BigDigit]) {
250249
(c, b)
251250
};
252251

253-
// Karatsuba multiplication is slower than long multiplication for small x and y:
252+
// We use three algorithms for different input sizes.
254253
//
255-
if x.len() <= 4 {
254+
// - For small inputs, long multiplication is fastest.
255+
// - Next we use Karatsuba multiplication (Toom-2), which we have optimized
256+
// to avoid unnecessary allocations for intermediate values.
257+
// - For the largest inputs we use Toom-3, which better optimizes the
258+
// number of operations, but uses more temporary allocations.
259+
//
260+
// The thresholds are somewhat arbitrary, chosen by evaluating the results
261+
// of `cargo bench --bench bigint multiply`.
262+
263+
if x.len() <= 32 {
264+
// Long multiplication:
256265
for (i, xi) in x.iter().enumerate() {
257266
mac_digit(&mut acc[i..], y, *xi);
258267
}
259-
} else {
268+
} else if x.len() <= 256 {
260269
/*
261270
* Karatsuba multiplication:
262271
*
@@ -375,6 +384,53 @@ fn mac3(acc: &mut [BigDigit], b: &[BigDigit], c: &[BigDigit]) {
375384
},
376385
NoSign => (),
377386
}
387+
388+
} else {
389+
// Toom-3 multiplication:
390+
//
391+
// Toom-3 is like Karatsuba above, but dividing the inputs into three parts.
392+
// Both are instances of Toom-Cook, using `k=3` and `k=2` respectively.
393+
//
394+
// FIXME: It would be nice to have comments breaking down the operations below.
395+
396+
let i = y.len()/3 + 1;
397+
398+
let x0_len = cmp::min(x.len(), i);
399+
let x1_len = cmp::min(x.len() - x0_len, i);
400+
401+
let y0_len = i;
402+
let y1_len = cmp::min(y.len() - y0_len, i);
403+
404+
let x0 = BigInt::from_slice(Plus, &x[..x0_len]);
405+
let x1 = BigInt::from_slice(Plus, &x[x0_len..x0_len + x1_len]);
406+
let x2 = BigInt::from_slice(Plus, &x[x0_len + x1_len..]);
407+
408+
let y0 = BigInt::from_slice(Plus, &y[..y0_len]);
409+
let y1 = BigInt::from_slice(Plus, &y[y0_len..y0_len + y1_len]);
410+
let y2 = BigInt::from_slice(Plus, &y[y0_len + y1_len..]);
411+
412+
let p = &x0 + &x2;
413+
let q = &y0 + &y2;
414+
415+
let p2 = &p - &x1;
416+
let q2 = &q - &y1;
417+
418+
let r0 = &x0 * &y0;
419+
let r4 = &x2 * &y2;
420+
let r1 = (p + x1) * (q + y1);
421+
let r2 = &p2 * &q2;
422+
let r3 = ((p2 + x2)*2 - x0) * ((q2 + y2)*2 - y0);
423+
424+
let mut comp3: BigInt = (r3 - &r1) / 3;
425+
let mut comp1: BigInt = (r1 - &r2) / 2;
426+
let mut comp2: BigInt = r2 - &r0;
427+
comp3 = (&comp2 - comp3)/2 + &r4*2;
428+
comp2 = comp2 + &comp1 - &r4;
429+
comp1 = comp1 - &comp3;
430+
431+
let result = r0 + (comp1 << 32*i) + (comp2 << 2*32*i) + (comp3 << 3*32*i) + (r4 << 4*32*i);
432+
let result_pos = result.to_biguint().unwrap();
433+
add2(&mut acc[..], &result_pos.data);
378434
}
379435
}
380436

0 commit comments

Comments
 (0)