Skip to content

Commit 993aa0b

Browse files
committed
i386: Fix some problems in stv cost model
this patch fixes some of problems with cosint in scalar to vector pass. In particular 1) the pass uses optimize_insn_for_size which is intended to be used by expanders and splitters and requires the optimization pass to use set_rtl_profile (bb) for currently processed bb. This is not done, so we get random stale info about hotness of insn. 2) register allocator move costs are all realtive to integer reg-reg move which has cost of 2, so it is (except for size tables and i386) a latency of instruction multiplied by 2. These costs has been duplicated and are now used in combination with rtx costs which are all based to COSTS_N_INSNS that multiplies latency by 4. Some of vectorizer costing contains COSTS_N_INSNS (move_cost) / 2 to compensate, but some new code does not. This patch adds compensatoin. Perhaps we should update the cost tables to use COSTS_N_INSNS everywher but I think we want to first fix inconsistencies. Also the tables will get optically much longer, since we have many move costs and COSTS_N_INSNS is a lot of characters. 3) variable m which decides how much to multiply integer variant (to account that with -m32 all 64bit computations needs 2 instructions) is declared unsigned which makes the signed computation of instruction gain to be done in unsigned type and breaks i.e. for division. 4) I added integer_to_sse costs which are currently all duplicationof sse_to_integer. AMD chips are asymetric and moving one direction is faster than another. I will chance costs incremnetally once vectorizer part is fixed up, too. There are two failures gcc.target/i386/minmax-6.c and gcc.target/i386/minmax-7.c. Both test stv on hasswell which no longer happens since SSE->INT and INT->SSE moves are now more expensive. There is only one instruction to convert: Computing gain for chain #1... Instruction gain 8 for 11: {r110:SI=smax(r116:SI,0);clobber flags:CC;} Instruction conversion gain: 8 Registers conversion cost: 8 <- this is integer_to_sse and sse_to_integer Total gain: 0 total gain used to be 4 since the patch doubles the conversion costs. According to agner fog's tables the costs should be 1 cycle which is correct here. Final code gnerated is: vmovd %esi, %xmm0 * latency 1 cmpl %edx, %esi je .L2 vpxor %xmm1, %xmm1, %xmm1 * latency 1 vpmaxsd %xmm1, %xmm0, %xmm0 * latency 1 vmovd %xmm0, %eax * latency 1 imull %edx, %eax cltq movzwl (%rdi,%rax,2), %eax ret cmpl %edx, %esi je .L2 xorl %eax, %eax * latency 1 testl %esi, %esi * latency 1 cmovs %eax, %esi * latency 2 imull %edx, %esi movslq %esi, %rsi movzwl (%rdi,%rsi,2), %eax ret Instructions with latency info are those really different. So the uncoverted code has sum of latencies 4 and real latency 3. Converted code has sum of latencies 4 and real latency 3 (vmod+vpmaxsd+vmov). So I do not quite see it should be a win. There is also a bug in costing MIN/MAX case ABS: case SMAX: case SMIN: case UMAX: case UMIN: /* We do not have any conditional move cost, estimate it as a reg-reg move. Comparisons are costed as adds. */ igain += m * (COSTS_N_INSNS (2) + ix86_cost->add); /* Integer SSE ops are all costed the same. */ igain -= ix86_cost->sse_op; break; Now COSTS_N_INSNS (2) is not quite right since reg-reg move should be 1 or perhaps 0. For Haswell cmov really is 2 cycles, but I guess we want to have that in cost vectors like all other instructions. I am not sure if this is really a win in this case (other minmax testcases seems to make sense). I have xfailed it for now and will check if that affects specs on LNT testers. I will proceed with similar fixes on vectorizer cost side. Sadly those introduces quite some differences in the testuiste (partly triggered by other costing problems, such as one of scatter/gather) gcc/ChangeLog: * config/i386/i386-features.cc (general_scalar_chain::vector_const_cost): Add BB parameter; handle size costs; use COSTS_N_INSNS to compute move costs. (general_scalar_chain::compute_convert_gain): Use optimize_bb_for_size instead of optimize_insn_for size; use COSTS_N_INSNS to compute move costs; update calls of general_scalar_chain::vector_const_cost; use ix86_cost->integer_to_sse. (timode_immed_const_gain): Add bb parameter; use optimize_bb_for_size_p. (timode_scalar_chain::compute_convert_gain): Use optimize_bb_for_size_p. * config/i386/i386-features.h (class general_scalar_chain): Update prototype of vector_const_cost. * config/i386/i386.h (struct processor_costs): Add integer_to_sse. * config/i386/x86-tune-costs.h (struct processor_costs): Copy sse_to_integer to integer_to_sse everywhere. gcc/testsuite/ChangeLog: * gcc.target/i386/minmax-6.c: xfail test that pmax is used. * gcc.target/i386/minmax-7.c: xfall test that pmin is used.
1 parent 512371d commit 993aa0b

File tree

6 files changed

+118
-50
lines changed

6 files changed

+118
-50
lines changed

gcc/config/i386/i386-features.cc

Lines changed: 76 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -518,15 +518,17 @@ scalar_chain::build (bitmap candidates, unsigned insn_uid, bitmap disallowed)
518518
instead of using a scalar one. */
519519

520520
int
521-
general_scalar_chain::vector_const_cost (rtx exp)
521+
general_scalar_chain::vector_const_cost (rtx exp, basic_block bb)
522522
{
523523
gcc_assert (CONST_INT_P (exp));
524524

525525
if (standard_sse_constant_p (exp, vmode))
526526
return ix86_cost->sse_op;
527+
if (optimize_bb_for_size_p (bb))
528+
return COSTS_N_BYTES (8);
527529
/* We have separate costs for SImode and DImode, use SImode costs
528530
for smaller modes. */
529-
return ix86_cost->sse_load[smode == DImode ? 1 : 0];
531+
return COSTS_N_INSNS (ix86_cost->sse_load[smode == DImode ? 1 : 0]) / 2;
530532
}
531533

532534
/* Compute a gain for chain conversion. */
@@ -547,34 +549,63 @@ general_scalar_chain::compute_convert_gain ()
547549
smaller modes than SImode the int load/store costs need to be
548550
adjusted as well. */
549551
unsigned sse_cost_idx = smode == DImode ? 1 : 0;
550-
unsigned m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
552+
int m = smode == DImode ? (TARGET_64BIT ? 1 : 2) : 1;
551553

552554
EXECUTE_IF_SET_IN_BITMAP (insns, 0, insn_uid, bi)
553555
{
554556
rtx_insn *insn = DF_INSN_UID_GET (insn_uid)->insn;
555557
rtx def_set = single_set (insn);
556558
rtx src = SET_SRC (def_set);
557559
rtx dst = SET_DEST (def_set);
560+
basic_block bb = BLOCK_FOR_INSN (insn);
558561
int igain = 0;
559562

560563
if (REG_P (src) && REG_P (dst))
561-
igain += 2 * m - ix86_cost->xmm_move;
564+
{
565+
if (optimize_bb_for_size_p (bb))
566+
/* reg-reg move is 2 bytes, while SSE 3. */
567+
igain += COSTS_N_BYTES (2 * m - 3);
568+
else
569+
/* Move costs are normalized to reg-reg move having cost 2. */
570+
igain += COSTS_N_INSNS (2 * m - ix86_cost->xmm_move) / 2;
571+
}
562572
else if (REG_P (src) && MEM_P (dst))
563-
igain
564-
+= m * ix86_cost->int_store[2] - ix86_cost->sse_store[sse_cost_idx];
573+
{
574+
if (optimize_bb_for_size_p (bb))
575+
/* Integer load/store is 3+ bytes and SSE 4+. */
576+
igain += COSTS_N_BYTES (3 * m - 4);
577+
else
578+
igain
579+
+= COSTS_N_INSNS (m * ix86_cost->int_store[2]
580+
- ix86_cost->sse_store[sse_cost_idx]) / 2;
581+
}
565582
else if (MEM_P (src) && REG_P (dst))
566-
igain += m * ix86_cost->int_load[2] - ix86_cost->sse_load[sse_cost_idx];
583+
{
584+
if (optimize_bb_for_size_p (bb))
585+
igain += COSTS_N_BYTES (3 * m - 4);
586+
else
587+
igain += COSTS_N_INSNS (m * ix86_cost->int_load[2]
588+
- ix86_cost->sse_load[sse_cost_idx]) / 2;
589+
}
567590
else
568591
{
569592
/* For operations on memory operands, include the overhead
570593
of explicit load and store instructions. */
571594
if (MEM_P (dst))
572-
igain += optimize_insn_for_size_p ()
573-
? -COSTS_N_BYTES (8)
574-
: (m * (ix86_cost->int_load[2]
575-
+ ix86_cost->int_store[2])
576-
- (ix86_cost->sse_load[sse_cost_idx] +
577-
ix86_cost->sse_store[sse_cost_idx]));
595+
{
596+
if (optimize_bb_for_size_p (bb))
597+
/* ??? This probably should account size difference
598+
of SSE and integer load rather than full SSE load. */
599+
igain -= COSTS_N_BYTES (8);
600+
else
601+
{
602+
int cost = (m * (ix86_cost->int_load[2]
603+
+ ix86_cost->int_store[2])
604+
- (ix86_cost->sse_load[sse_cost_idx] +
605+
ix86_cost->sse_store[sse_cost_idx]));
606+
igain += COSTS_N_INSNS (cost) / 2;
607+
}
608+
}
578609

579610
switch (GET_CODE (src))
580611
{
@@ -595,7 +626,7 @@ general_scalar_chain::compute_convert_gain ()
595626
igain += ix86_cost->shift_const - ix86_cost->sse_op;
596627

597628
if (CONST_INT_P (XEXP (src, 0)))
598-
igain -= vector_const_cost (XEXP (src, 0));
629+
igain -= vector_const_cost (XEXP (src, 0), bb);
599630
break;
600631

601632
case ROTATE:
@@ -631,16 +662,17 @@ general_scalar_chain::compute_convert_gain ()
631662
igain += m * ix86_cost->add;
632663

633664
if (CONST_INT_P (XEXP (src, 0)))
634-
igain -= vector_const_cost (XEXP (src, 0));
665+
igain -= vector_const_cost (XEXP (src, 0), bb);
635666
if (CONST_INT_P (XEXP (src, 1)))
636-
igain -= vector_const_cost (XEXP (src, 1));
667+
igain -= vector_const_cost (XEXP (src, 1), bb);
637668
if (MEM_P (XEXP (src, 1)))
638669
{
639-
if (optimize_insn_for_size_p ())
670+
if (optimize_bb_for_size_p (bb))
640671
igain -= COSTS_N_BYTES (m == 2 ? 3 : 5);
641672
else
642-
igain += m * ix86_cost->int_load[2]
643-
- ix86_cost->sse_load[sse_cost_idx];
673+
igain += COSTS_N_INSNS
674+
(m * ix86_cost->int_load[2]
675+
- ix86_cost->sse_load[sse_cost_idx]) / 2;
644676
}
645677
break;
646678

@@ -698,7 +730,7 @@ general_scalar_chain::compute_convert_gain ()
698730
case CONST_INT:
699731
if (REG_P (dst))
700732
{
701-
if (optimize_insn_for_size_p ())
733+
if (optimize_bb_for_size_p (bb))
702734
{
703735
/* xor (2 bytes) vs. xorps (3 bytes). */
704736
if (src == const0_rtx)
@@ -722,28 +754,29 @@ general_scalar_chain::compute_convert_gain ()
722754
/* DImode can be immediate for TARGET_64BIT
723755
and SImode always. */
724756
igain += m * COSTS_N_INSNS (1);
725-
igain -= vector_const_cost (src);
757+
igain -= vector_const_cost (src, bb);
726758
}
727759
}
728760
else if (MEM_P (dst))
729761
{
730762
igain += (m * ix86_cost->int_store[2]
731763
- ix86_cost->sse_store[sse_cost_idx]);
732-
igain -= vector_const_cost (src);
764+
igain -= vector_const_cost (src, bb);
733765
}
734766
break;
735767

736768
case VEC_SELECT:
737769
if (XVECEXP (XEXP (src, 1), 0, 0) == const0_rtx)
738770
{
739771
// movd (4 bytes) replaced with movdqa (4 bytes).
740-
if (!optimize_insn_for_size_p ())
741-
igain += ix86_cost->sse_to_integer - ix86_cost->xmm_move;
772+
if (!optimize_bb_for_size_p (bb))
773+
igain += COSTS_N_INSNS (ix86_cost->sse_to_integer
774+
- ix86_cost->xmm_move) / 2;
742775
}
743776
else
744777
{
745778
// pshufd; movd replaced with pshufd.
746-
if (optimize_insn_for_size_p ())
779+
if (optimize_bb_for_size_p (bb))
747780
igain += COSTS_N_BYTES (4);
748781
else
749782
igain += ix86_cost->sse_to_integer;
@@ -769,11 +802,11 @@ general_scalar_chain::compute_convert_gain ()
769802
/* Cost the integer to sse and sse to integer moves. */
770803
if (!optimize_function_for_size_p (cfun))
771804
{
772-
cost += n_sse_to_integer * ix86_cost->sse_to_integer;
805+
cost += n_sse_to_integer * COSTS_N_INSNS (ix86_cost->sse_to_integer) / 2;
773806
/* ??? integer_to_sse but we only have that in the RA cost table.
774807
Assume sse_to_integer/integer_to_sse are the same which they
775808
are at the moment. */
776-
cost += n_integer_to_sse * ix86_cost->sse_to_integer;
809+
cost += n_integer_to_sse * COSTS_N_INSNS (ix86_cost->integer_to_sse) / 2;
777810
}
778811
else if (TARGET_64BIT || smode == SImode)
779812
{
@@ -1508,13 +1541,13 @@ general_scalar_chain::convert_insn (rtx_insn *insn)
15081541
with numerous special cases. */
15091542

15101543
static int
1511-
timode_immed_const_gain (rtx cst)
1544+
timode_immed_const_gain (rtx cst, basic_block bb)
15121545
{
15131546
/* movabsq vs. movabsq+vmovq+vunpacklqdq. */
15141547
if (CONST_WIDE_INT_P (cst)
15151548
&& CONST_WIDE_INT_NUNITS (cst) == 2
15161549
&& CONST_WIDE_INT_ELT (cst, 0) == CONST_WIDE_INT_ELT (cst, 1))
1517-
return optimize_insn_for_size_p () ? -COSTS_N_BYTES (9)
1550+
return optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (9)
15181551
: -COSTS_N_INSNS (2);
15191552
/* 2x movabsq ~ vmovdqa. */
15201553
return 0;
@@ -1546,33 +1579,34 @@ timode_scalar_chain::compute_convert_gain ()
15461579
rtx src = SET_SRC (def_set);
15471580
rtx dst = SET_DEST (def_set);
15481581
HOST_WIDE_INT op1val;
1582+
basic_block bb = BLOCK_FOR_INSN (insn);
15491583
int scost, vcost;
15501584
int igain = 0;
15511585

15521586
switch (GET_CODE (src))
15531587
{
15541588
case REG:
1555-
if (optimize_insn_for_size_p ())
1589+
if (optimize_bb_for_size_p (bb))
15561590
igain = MEM_P (dst) ? COSTS_N_BYTES (6) : COSTS_N_BYTES (3);
15571591
else
15581592
igain = COSTS_N_INSNS (1);
15591593
break;
15601594

15611595
case MEM:
1562-
igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (7)
1596+
igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (7)
15631597
: COSTS_N_INSNS (1);
15641598
break;
15651599

15661600
case CONST_INT:
15671601
if (MEM_P (dst)
15681602
&& standard_sse_constant_p (src, V1TImode))
1569-
igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (11) : 1;
1603+
igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (11) : 1;
15701604
break;
15711605

15721606
case CONST_WIDE_INT:
15731607
/* 2 x mov vs. vmovdqa. */
15741608
if (MEM_P (dst))
1575-
igain = optimize_insn_for_size_p () ? COSTS_N_BYTES (3)
1609+
igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (3)
15761610
: COSTS_N_INSNS (1);
15771611
break;
15781612

@@ -1587,14 +1621,14 @@ timode_scalar_chain::compute_convert_gain ()
15871621
if (!MEM_P (dst))
15881622
igain = COSTS_N_INSNS (1);
15891623
if (CONST_SCALAR_INT_P (XEXP (src, 1)))
1590-
igain += timode_immed_const_gain (XEXP (src, 1));
1624+
igain += timode_immed_const_gain (XEXP (src, 1), bb);
15911625
break;
15921626

15931627
case ASHIFT:
15941628
case LSHIFTRT:
15951629
/* See ix86_expand_v1ti_shift. */
15961630
op1val = INTVAL (XEXP (src, 1));
1597-
if (optimize_insn_for_size_p ())
1631+
if (optimize_bb_for_size_p (bb))
15981632
{
15991633
if (op1val == 64 || op1val == 65)
16001634
scost = COSTS_N_BYTES (5);
@@ -1628,7 +1662,7 @@ timode_scalar_chain::compute_convert_gain ()
16281662
case ASHIFTRT:
16291663
/* See ix86_expand_v1ti_ashiftrt. */
16301664
op1val = INTVAL (XEXP (src, 1));
1631-
if (optimize_insn_for_size_p ())
1665+
if (optimize_bb_for_size_p (bb))
16321666
{
16331667
if (op1val == 64 || op1val == 127)
16341668
scost = COSTS_N_BYTES (7);
@@ -1706,7 +1740,7 @@ timode_scalar_chain::compute_convert_gain ()
17061740
case ROTATERT:
17071741
/* See ix86_expand_v1ti_rotate. */
17081742
op1val = INTVAL (XEXP (src, 1));
1709-
if (optimize_insn_for_size_p ())
1743+
if (optimize_bb_for_size_p (bb))
17101744
{
17111745
scost = COSTS_N_BYTES (13);
17121746
if ((op1val & 31) == 0)
@@ -1738,16 +1772,16 @@ timode_scalar_chain::compute_convert_gain ()
17381772
{
17391773
if (GET_CODE (XEXP (src, 0)) == AND)
17401774
/* and;and;or (9 bytes) vs. ptest (5 bytes). */
1741-
igain = optimize_insn_for_size_p() ? COSTS_N_BYTES (4)
1742-
: COSTS_N_INSNS (2);
1775+
igain = optimize_bb_for_size_p (bb) ? COSTS_N_BYTES (4)
1776+
: COSTS_N_INSNS (2);
17431777
/* or (3 bytes) vs. ptest (5 bytes). */
1744-
else if (optimize_insn_for_size_p ())
1778+
else if (optimize_bb_for_size_p (bb))
17451779
igain = -COSTS_N_BYTES (2);
17461780
}
17471781
else if (XEXP (src, 1) == const1_rtx)
17481782
/* and;cmp -1 (7 bytes) vs. pcmpeqd;pxor;ptest (13 bytes). */
1749-
igain = optimize_insn_for_size_p() ? -COSTS_N_BYTES (6)
1750-
: -COSTS_N_INSNS (1);
1783+
igain = optimize_bb_for_size_p (bb) ? -COSTS_N_BYTES (6)
1784+
: -COSTS_N_INSNS (1);
17511785
break;
17521786

17531787
default:

gcc/config/i386/i386-features.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -188,7 +188,7 @@ class general_scalar_chain : public scalar_chain
188188

189189
private:
190190
void convert_insn (rtx_insn *insn) final override;
191-
int vector_const_cost (rtx exp);
191+
int vector_const_cost (rtx exp, basic_block bb);
192192
rtx convert_rotate (enum rtx_code, rtx op0, rtx op1, rtx_insn *insn);
193193
};
194194

gcc/config/i386/i386.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -179,6 +179,7 @@ struct processor_costs {
179179
const int xmm_move, ymm_move, /* cost of moving XMM and YMM register. */
180180
zmm_move;
181181
const int sse_to_integer; /* cost of moving SSE register to integer. */
182+
const int integer_to_sse; /* cost of moving integer register to SSE. */
182183
const int gather_static, gather_per_elt; /* Cost of gather load is computed
183184
as static + per_item * nelts. */
184185
const int scatter_static, scatter_per_elt; /* Cost of gather store is

0 commit comments

Comments
 (0)