-
Notifications
You must be signed in to change notification settings - Fork 14
Short circuit when functions are syntactically equal #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Short circuit when functions are syntactically equal #259
Conversation
1885f01
to
6eb4132
Compare
We may want equality.ml in |
wp/plugin/lib/equality.ml
Outdated
(* some of these pattern matchings are massive; we don't want fragile warnings *) | ||
[@@@warning "-4"] | ||
|
||
module H = Bap.Std.Graphs.Ir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a logic to these names? What is H? Maybe Ir
would be a better choice if it doesn't clash with anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm not really. I'll swap it out for something a bit more descriptive (Ir
if I can).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can, I think module names shadow in OCaml (sometimes you need to be careful).
wp/plugin/lib/equality.ml
Outdated
module BFS = Graph.Traverse.Bfs (J) | ||
module TidSet = Tid.Set | ||
|
||
module TidMap = Tid.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, I don't think you're saving much with these shorthands.
wp/plugin/lib/equality.ml
Outdated
type varToVarMap = Var.t VarMap.t | ||
|
||
(* Replaces the real variables with indices with only their base *) | ||
let rec strip_indices (e : Exp.t) : Exp.t = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be expressed more succinctly using the mapper
interface. I think this is mapping Var.base over Var constructors and that's it? http://binaryanalysisplatform.github.io/bap/api/odoc/bap/Bap/Std/Exp/class-mapper/index.html
wp/plugin/lib/equality.ml
Outdated
(cfg2 : Bap.Std.Graphs.Ir.t) : ((blk term) TidMap.t) * ((blk term) TidMap.t) = | ||
let acc = fun ele tid_map -> | ||
let tid = Term.tid (get_label ele) in | ||
TidMap.change tid_map tid ~f:(fun _ -> Some (get_label ele)) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http://binaryanalysisplatform.github.io/bap/api/odoc/bap/Bap/Std/Tid/Map/index.html#val-update TidMap.update seems more appropriate
wp/plugin/lib/equality.mli
Outdated
constructed block mapping must match the corresponding TIDS | ||
in the same respective jmp_t in sub2. | ||
*) | ||
val exist_isomorphism : Sub.t -> Sub.t -> bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider, I can see it either way. Since you find an isomorphism, it seems it might be useful to expose it via returning iso option
rather than bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: why isn't exists
? Or even better, get_iso
?
wp/plugin/lib/equality.ml
Outdated
|
||
(* Iterate through all reachable nodes in each cfg | ||
* and generate a map from tid to blk.*) | ||
let get_node_maps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the two cfgs and maps are built entirely independently, couldn't one make a get_node_map : Ir.t -> blk TidMap.t
function that you call twice instead?
wp/plugin/lib/equality.ml
Outdated
module VarSet = Var.Set | ||
module TidTupleMap = Map.Make(struct type t = Tid.t * Tid.t [@@deriving sexp, compare] end) | ||
|
||
type varToVarMap = Var.t VarMap.t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer you not define this type synonym. Var.t VarMap.t
is clearer and more self explanatory than varToVarMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, var
is an alias to Var.t
if that helps)
wp/plugin/lib/equality.ml
Outdated
a map from (tid_blk_sub1, tid_blk_sub2) to the mapping between variables | ||
used in those subs. *) | ||
let compare_blocks (sub1: Sub.t) (sub2 : Sub.t) : | ||
(TidSet.t TidMap.t) * (varToVarMap TidTupleMap.t) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a suspicion that it would be better to use ((Var.t Var.Map.t) Tid.Map.t) Tid.Map.t
rather than varToVarMap TidTupleMap.t
. Essentially, currying the tid keys in that dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fun stuff.
wp/plugin/lib/equality.ml
Outdated
let blk_varmap = TidTupleMap.empty in | ||
let evaluator = (fun (node1 : Bap.Std.Graphs.Ir.Node.t) (blk_map, blk_varmap) -> | ||
let blk1 = get_label node1 in | ||
let tid1 = Term.tid (get_label node1) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be blk1 instead of repeating (get_label node1)
Thanks for the suggestions! I've made corresponding changes. I agree--returning the mapping allows for more flexibility. I think the currying suggestion is clever--it makes exposing functions with aliased types easier to include in the This Also, I'm making liberal use of I'm testing the changes by running them on grammatech. I'll post again when I've got results. EDIT: has the same behavior as before (marks 65 of the functions as syntactically equal). |
There's no real norm between using monadic lets and Apparently, using the ppx versions of let bindings is deprecated, instead using the (newish) feature that allows defining new I'd suggest just using the combinators for now. |
| None -> | ||
blk_map, blk_varmap) in | ||
let blk_map, blk_varmap = BFS.fold inner_evaluator (blk_map, blk_varmap) cfg2 in | ||
blk_map, blk_varmap) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this tuple in parens
(* Compares everything about a block EXCEPT tids, which will be done later. *) | ||
let compare_block (blk1 : blk term) (blk2 : blk term) map = | ||
compare_defs (Term.enum def_t blk1 |> Sequence.to_list) | ||
(Term.enum def_t blk2 |> Sequence.to_list) map >>= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to convert to lists?
(* Maps all virtual variables from e1 to their analagous variable within vmap; | ||
if a variable not in vmap, is added to vmap from e2. | ||
Returns None if cannot map variables because subs do not match in structure *) | ||
let rec sub_exp (vmap : Var.t Var.Map.t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd separate this out into two functions. One that performs substitution and one that compare_exp
let cfg1, cfg2 = Sub.to_cfg sub1, Sub.to_cfg sub2 in | ||
let tid_to_blk1, tid_to_blk2 = get_node_map cfg1, get_node_map cfg2 in | ||
let tid_map, var_maps = compare_blocks_syntax sub1 sub2 in | ||
let used_set = Tid.Set.empty in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move into get_isomorphism
(tid_to_blk1 : (blk term) Tid.Map.t) (tid_to_blk2 : (blk term) Tid.Map.t) | ||
(var_maps : ((Var.t Var.Map.t) Tid.Map.t) Tid.Map.t) (merged_map : Var.t Var.Map.t) : | ||
(Tid.t Tid.Map.t) option = | ||
let node = List.hd node_stack in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fold over the node_list I think
This PR adds in a check for syntactic equality (enabled with
--syntax-equality
). Example usage may be seen inrun_wp_*.sh
scripts in thesyntax_equality
test folder. This PR closes #211, and is a first step towards the ideas in #254. Specifically,equality.ml
exports anexist_isomorphism
function. This function checks if there's a mapping between the twosub.t
functions that preserves syntax structure, use of virtual variables (virtual variables may go by different names but must be used in all the same places), and control flow.I've documented the code in
equality.ml
andequality.mli
. I'm also including the results I posted to the slack for the run on grammatech. On average, only running the syntax check takes 8 seconds on my laptop. I found 65 syntactically equal functions on grammatech. When running CBAT, I got:So overall, the results that CBAT finishes within the timeout of 1000s turn out to match with this syntax check. It's not clear to me how to test the UNKNOWNs further.
This first plot is function (on the x axis) vs # syntactically idental blocks / total blocks in the sub. These are encouraging numbers performance improvements in #254 .

This second plot is time in seconds on the x axis vs # statements (def_t, jmp_t) on the y axis. The green dots are the run for syntactically equal functions (of which there are 65); the red dots are master branch cbat without any special flags. @philzook58 mentioned that this is big savings if there are a lot of long functions that are syntactically identical in the binary.
