Skip to content

Short circuit when functions are syntactically equal #259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

DieracDelta
Copy link
Contributor

This PR adds in a check for syntactic equality (enabled with --syntax-equality ). Example usage may be seen in run_wp_*.sh scripts in the syntax_equality test folder. This PR closes #211, and is a first step towards the ideas in #254. Specifically,
equality.ml exports an exist_isomorphism function. This function checks if there's a mapping between the two sub.t functions that preserves syntax structure, use of virtual variables (virtual variables may go by different names but must be used in all the same places), and control flow.

I've documented the code in equality.ml and equality.mli. I'm also including the results I posted to the slack for the run on grammatech. On average, only running the syntax check takes 8 seconds on my laptop. I found 65 syntactically equal functions on grammatech. When running CBAT, I got:

  • On the master branch cbat without any special flags: 36 SATS, 3 UNKNOWN, 26 UNSAT
  • On the master branch cbat with mem-offset and pointer flag on all registers: 0 SAT, 7 UNKNOWN, and 58 UNSAT

So overall, the results that CBAT finishes within the timeout of 1000s turn out to match with this syntax check. It's not clear to me how to test the UNKNOWNs further.

This first plot is function (on the x axis) vs # syntactically idental blocks / total blocks in the sub. These are encouraging numbers performance improvements in #254 .
image

This second plot is time in seconds on the x axis vs # statements (def_t, jmp_t) on the y axis. The green dots are the run for syntactically equal functions (of which there are 65); the red dots are master branch cbat without any special flags. @philzook58 mentioned that this is big savings if there are a lot of long functions that are syntactically identical in the binary.
image

@DieracDelta DieracDelta force-pushed the jrestivo/syntactic_equality branch from 1885f01 to 6eb4132 Compare October 28, 2020 20:00
@DieracDelta DieracDelta requested a review from fortunac October 28, 2020 20:03
@philzook58
Copy link
Contributor

We may want equality.ml in bap_wp not plugin.

(* some of these pattern matchings are massive; we don't want fragile warnings *)
[@@@warning "-4"]

module H = Bap.Std.Graphs.Ir
Copy link
Contributor

@philzook58 philzook58 Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a logic to these names? What is H? Maybe Ir would be a better choice if it doesn't clash with anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm not really. I'll swap it out for something a bit more descriptive (Ir if I can).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can, I think module names shadow in OCaml (sometimes you need to be careful).

module BFS = Graph.Traverse.Bfs (J)
module TidSet = Tid.Set

module TidMap = Tid.Map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, I don't think you're saving much with these shorthands.

type varToVarMap = Var.t VarMap.t

(* Replaces the real variables with indices with only their base *)
let rec strip_indices (e : Exp.t) : Exp.t =
Copy link
Contributor

@philzook58 philzook58 Nov 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be expressed more succinctly using the mapper interface. I think this is mapping Var.base over Var constructors and that's it? http://binaryanalysisplatform.github.io/bap/api/odoc/bap/Bap/Std/Exp/class-mapper/index.html

(cfg2 : Bap.Std.Graphs.Ir.t) : ((blk term) TidMap.t) * ((blk term) TidMap.t) =
let acc = fun ele tid_map ->
let tid = Term.tid (get_label ele) in
TidMap.change tid_map tid ~f:(fun _ -> Some (get_label ele)) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constructed block mapping must match the corresponding TIDS
in the same respective jmp_t in sub2.
*)
val exist_isomorphism : Sub.t -> Sub.t -> bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider, I can see it either way. Since you find an isomorphism, it seems it might be useful to expose it via returning iso option rather than bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: why isn't exists? Or even better, get_iso?


(* Iterate through all reachable nodes in each cfg
* and generate a map from tid to blk.*)
let get_node_maps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two cfgs and maps are built entirely independently, couldn't one make a get_node_map : Ir.t -> blk TidMap.t function that you call twice instead?

module VarSet = Var.Set
module TidTupleMap = Map.Make(struct type t = Tid.t * Tid.t [@@deriving sexp, compare] end)

type varToVarMap = Var.t VarMap.t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer you not define this type synonym. Var.t VarMap.t is clearer and more self explanatory than varToVarMap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, var is an alias to Var.t if that helps)

a map from (tid_blk_sub1, tid_blk_sub2) to the mapping between variables
used in those subs. *)
let compare_blocks (sub1: Sub.t) (sub2 : Sub.t) :
(TidSet.t TidMap.t) * (varToVarMap TidTupleMap.t) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suspicion that it would be better to use ((Var.t Var.Map.t) Tid.Map.t) Tid.Map.t rather than varToVarMap TidTupleMap.t. Essentially, currying the tid keys in that dictionary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun stuff.

let blk_varmap = TidTupleMap.empty in
let evaluator = (fun (node1 : Bap.Std.Graphs.Ir.Node.t) (blk_map, blk_varmap) ->
let blk1 = get_label node1 in
let tid1 = Term.tid (get_label node1) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be blk1 instead of repeating (get_label node1)

@DieracDelta
Copy link
Contributor Author

DieracDelta commented Nov 6, 2020

Thanks for the suggestions! I've made corresponding changes.

I agree--returning the mapping allows for more flexibility. I think the currying suggestion is clever--it makes exposing functions with aliased types easier to include in the mli file. Specifically, this lets me also expose compare_blocks_syntax. This function returns the syntactically equal (up to TID) blocks. Seems useful for #254 .

This mapper interface is awesome. I'm wondering if there's an easy way to use it to rewrite sub_exp in a cleaner fashion that doesn't involve fragile pattern matching. The issue is we need to carry around the information from both subs at the same time, and return something different if the subs don't match. This creates differing type signatures for the declared functions.

Also, I'm making liberal use of >>= and >>|. I could use let syntax instead if that's clearer. Seems like the let syntax is used in bildb, but the inline operators are used in wp. I'm not sure what's preferred here.

I'm testing the changes by running them on grammatech. I'll post again when I've got results. EDIT: has the same behavior as before (marks 65 of the functions as syntactically equal).

@codyroux
Copy link
Contributor

codyroux commented Nov 6, 2020

There's no real norm between using monadic lets and >>=s, though Ivan prefers the latter, and his opinion has some weight.

Apparently, using the ppx versions of let bindings is deprecated, instead using the (newish) feature that allows defining new let-like keywords.

I'd suggest just using the combinators for now.

| None ->
blk_map, blk_varmap) in
let blk_map, blk_varmap = BFS.fold inner_evaluator (blk_map, blk_varmap) cfg2 in
blk_map, blk_varmap) in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this tuple in parens

(* Compares everything about a block EXCEPT tids, which will be done later. *)
let compare_block (blk1 : blk term) (blk2 : blk term) map =
compare_defs (Term.enum def_t blk1 |> Sequence.to_list)
(Term.enum def_t blk2 |> Sequence.to_list) map >>=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to convert to lists?

(* Maps all virtual variables from e1 to their analagous variable within vmap;
if a variable not in vmap, is added to vmap from e2.
Returns None if cannot map variables because subs do not match in structure *)
let rec sub_exp (vmap : Var.t Var.Map.t)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd separate this out into two functions. One that performs substitution and one that compare_exp

let cfg1, cfg2 = Sub.to_cfg sub1, Sub.to_cfg sub2 in
let tid_to_blk1, tid_to_blk2 = get_node_map cfg1, get_node_map cfg2 in
let tid_map, var_maps = compare_blocks_syntax sub1 sub2 in
let used_set = Tid.Set.empty in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move into get_isomorphism

(tid_to_blk1 : (blk term) Tid.Map.t) (tid_to_blk2 : (blk term) Tid.Map.t)
(var_maps : ((Var.t Var.Map.t) Tid.Map.t) Tid.Map.t) (merged_map : Var.t Var.Map.t) :
(Tid.t Tid.Map.t) option =
let node = List.hd node_stack in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fold over the node_list I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Special Case Syntactically Equal Functions
3 participants