		TO RUN THE CLONE-DETECTION TOOL
		-------------------------------


Needed files:
	clones-impl/NEW/MakeExprTrees/operators.stk
	clones-impl/NEW/MakeExprTrees/create-sexprs.stk

	clones-impl/NEW/SCHEME-SCRIPTS/name2pdg.stk
    
	clones-impl/NEW/FindClonePairs/reachability/libreachability.so
	clones-impl/NEW/FindClonePairs/back-slice.stk

	clones-impl/NEW/FilterClones/exsize
	clones-impl/NEW/FilterClones/filter-pairs.stk
	clones-impl/NEW/FilterClones/filter-pairs/libfilter_pairs.so

	clones-impl/NEW/GroupClones/sep-sizes
	clones-impl/NEW/GroupClones/group-pairs.stk
	clones-impl/NEW/GroupClones/my-utils.stk
	clones-impl/NEW/GroupClones/group-pairs/libgroup_pairs.so

1. Create a CodeSurfer project.  Use the command line:

   csurf -retain-pdg-vertex-to-ast-mapping yes -user-library-path \
     ${CSURF}/libmodels -cfg-edges both -use-def-sets yes cc -o foo <files>

   I am assuming that the library stubs have been built in the
   directory ${CSURF}/libmodels. The CodeSurfer GUI can be used to
   build the library stubs.
     
2. In the directory that contains the project:
     (a) Make sure that you have copies of operators.stk and
         create-sexprs.stk in that directory (copy them from MakeExprTrees).
     (b) Run create-sexprs:
            csurf -b
	    (load "create-sexprs.stk")
	    (main <input> <output>)
         where <input> is the name of the sdg file in the CSURF.FILES
	 subdirectory (e.g., "test.sdg"), and <output> is the name of
	 the output file (e.g., "test.sexprs").  Note that both file names
	 must be enclosed in quotes.

3. In the same directory:
      (a) Make sure that you have copies of name2pdg.stk,
          libreachability.so, and back-slice.stk (copy them from
	      the locations mentioned earlier in this file).
      (b) Run back-slice.stk:
   	     csurf <project>
	     (load "back-slice.stk")
	     (main <input> <files>)
	     (quit)
          where <input> is the name of the file created by "create-sexprs.stk"
          (e.g., "test.sexprs") and <files> is a list of file names (in double
	  quotes) in which the tool should look for clones (remember to put a
	  tick before the list).  EXAMPLE: (main "test.sexprs"
      '("test1.c" "test2.c")).

   A number of useful files, including OUTPUT and num-nodes-STATS will be
   created.

3. Filter small clones:
     (a) Make sure you have a copy of "exsize".

     (b) Determine the size of the largest clone.  Type:
	    sort -gr num-nodes-STATS | head -1
	 I call this <MAXSIZE> from here on.

     (c) Choose the threshold. I call this <MINSIZE> from here on. I
         typically choose <MINSIZE> to be 5.

     (d) Choose the name of the file that will hold the clones whose size
         is >= <MINSIZE>. I call this file <OUTPUT.MINSIZE+> from here on.

     (e) Run: exsize <MINSIZE> <MAXSIZE> > <OUTPUT.MINSIZE+>

      Notes:

        (a) exsize will not work unless OUTPUT and
            num-nodes-STATS are present.

        (b) exsize takes a third (optional) command-line argument,
            which can be any regular expression e; if e is given, then
            a clone pair in OUTPUT is output by exsize iff each
            clone in the pair is of size >= <MINSIZE> *and*
            the textual encoding of the pair (in OUTPUT) satisfies the
            regular expression e.

4. Remove clone pairs that are subsumed by or that are "worse" than
   other clone pairs. A clone pair <c1, c2> is said to be subsumed by
   a clone pair <c3, c4> if (a) the nodes in c1 are a subset of the
   nodes in c3 and the nodes in c2 are a subset of the nodes in c4, or
   (b) the nodes in c1 are a subset of the nodes in c4 and the nodes
   in c2 are a subset of the nodes in c3.

   The definition of when a clone pair is worse than
   another clone pair is provided in the comment at the
   beginning of the function RemoveWorseClonePairs_NoOvChk, in
   <clones-impl>/OLD/filter-pairs/filter_pairs.cc. Note that a clone pair
   <c1, c2> is worse than another clone pair <c3, c4> *only if* the
   two overlap by some percentage <COMMON>; i.e., either (a) COMMON%
   of the nodes in c1 are present in c3 and vice versa, and COMMON%
   of the nodes in c2 are present in c4 and vice versa, or (b) COMMON%
   of the nodes in c1 are present in c4 and vice versa, and COMMON%
   of the nodes in c2 are present in c3 and vice versa. You get to pick
   COMMON; I typically choose it to be 80.

   Here is how this step proceeds:

     (a) Make sure you have copies of "filter-pairs.stk" and
         "libfilter_pairs.so".

     (b) Choose the name of the file that will contain the clones that
         are *not* removed in this step.  From here on I call this
         file <OUTPUT.MINSIZE+.filtered>.

         Choose the name of the file that will contain the sizes of the
         non-removed clones. From here on I call this file
         <num-nodes-STATS.MINSIZE+.filtered>. 
     
      (c) Run csurf (Note that there is no need to say csurf <program>).

      (d) At the "STk>" prompt type:
            (load "filter-pairs.stk")

      (e) Type:
      	    (main "<OUTPUT.MINSIZE+>" <COMMON> "<OUTPUT.MINSIZE+.filtered>"
                "<num-nodes-STATS.MINSIZE+.filtered>")

      (f) When the "STk>" prompt returns type Ctrl-D to quit
          codesurfer. Notice that <OUTPUT.MINSIZE+.filtered> and
          <num-nodes-STATS.MINSIZE+.filtered> will have been created,
          and have the same number of lines.

5. Separate the clone pairs in <OUTPUT.MINSIZE+.filtered> into
   different files, such that all clones having the same number of
   nodes end up in the same file. In other words, create one new
   output file for each size in the range [<MINSIZE>, <MAXSIZE>]:

     (a) make sure you have a copy of "sep-sizes".

     (b) Type:
	 	sep-sizes <MAXSIZE>

        Note: Currently, the script sep-sizes assumes that
          <OUTPUT.MINSIZE+.filtered> is equal to "OUTPUT.5+.filtered",
          and that <num-nodes-STATS.MINSIZE+.filtered> is
          "num-nodes-STATS.5+.filtered"; i.e., these files names are
          hard-coded into the sep-sizes script (however, the script
          *does not* assume that <MINSIZE> is equal to 5).

          Also, for each integer i in the range [<MINSIZE>,
          <MINSIZE>], the output file produced is called
          "OUTPUT.filtered.i".

          The script can be rather easily modified to make these names
          be parameters, rather than hard-coded values.

   The reason we separate the clones by size is that two clone pairs
   can be combined into a single group only if the clones in the two
   pairs are of equal size.

6. In this final step we combine the pairs into groups:

    a. Make sure you have copies of group-pairs.stk, my-utils.stk and
       libgroup_pairs.so. Start ``csurf''. Once again, there is no
       need to say ``csurf <program>''.

    b. At the "STk>" prompt type:

       (load "group-pairs.stk")
       (load "my-utils.stk")
       (group-script <MAXSIZE> <MINSIZE>)
       Ctrl-D

    Note:

      "group-script" assumes that for each $i$ in the range
      [<MINSIZE>, <MINSIZE>] the file "OUTPUT.filtered.i" is
      present. It produces two files for each such $i$:
      OUTPUT.filtered.i.groups, and
      OUTPUT.filtered.i.group-sizes. OUTPUT.filtered.i.groups contains
      the clone groups, one group per
      line. OUTPUT.filtered.i.group-sizes has as many lines as the
      number of groups in OUTPUT.filtered.i.groups; each line contains
      the number of clones in the corresponding group.

