Text and String Manipulation Problems in Java with FULL source code

Specification 

This assessed exercise is concerned with a number of text and string problems, each of which can be solved by an efficient algorithm using an application of the suffix tree data structure. The aim of the exercise is to illustrate the wide range of practical applications of this versatile data structure, and to provide experience of manipulating the suffix tree – this should aid understanding of the structure itself. The text and string problems that we will be concerned with are as follows:

  • searching for a given string in a piece of text; 

  • determining all occurrences of a given string in a piece of text; 

  • finding a longest repeated substring in a piece of text; 

  • finding a longest common substring of two pieces of text. 


You are given Java code for reading a specified text file into a string s and building the suffix tree for s. (The suffix tree construction algorithm works on the principle of repeated insertion of suffixes together with node-splitting; for the purposes of this assessed exercise, it may be helpful, but it is not essential, to understand how the construction algorithm works.) Your task is to extend the given code in order to implement solutions to each of Tasks 1-4 as indicated above, displaying the results for the user in an appropriate format (as suggested below).
appropriate format (as suggested below).

Setup 

Check AlgorithmicsII_16-17_setup_code attached. This contains the files Main.java, FileInput.java, text1.txt and text2.txt, and the subdirectory SuffixTreePackage. This subdirectory contains the files SuffixTree.java, SuffixTreeNode.java, SuffixTreeAppl.java, Task1Info.java, Task2Info.java, Task3Info.java and Task4Info.java.
To compile the program, issue the command javac Main.java from the top-level directory. To run the program, issue the command java Main from the same directory.

The given code 

Main.java is the main class, and the entry point for the program. It is intended that this class should implement a simple text-based menu (the user is prompted to enter a number (1-4) or ‘q’ to quit the program) for accessing the suffix tree applications that have been implemented as part of Tasks 1-4 listed in Section 3. The main method in this class should use FileInput.java in order to handle input from text files.
The package SuffixTreePackage contains seven classes, as follows. The class SuffixTreeNode is used in order to represent a node of a suffix tree; it includes a method to add a child to a given node and simple accessor and mutator methods. The class SuffixTree is used in order to represent a suffix tree; it includes methods for building a suffix tree and also various accessor and mutator methods. The class SuffixTreeAppl contains methods for accessing the functions relating to Tasks 1-4. The given method corresponding to Task N returns an object of class TaskNInfo (a class of the package SuffixTreePackage), containing the information to be displayed to the user.
Incomplete parts of the code
You will find that one constructor and a number of methods in Main.java, SuffixTree.java and SuffixTreeAppl.java are incomplete. Firstly, in Main.java,the main method is incomplete, as follows:


public static void main(String args[]) {
  Scanner standardInput = new Scanner(System.in);
  do {
    // display prompt for user
    System.out.println();
    System.out.print("Enter the number of the task or type 'q' to quit: ");
    / read in a line from standard input
    String line = standardInput.nextLine();
    System.out.println();
    try {
      // try to extract an integer from line if possible
      int numTask = Integer.parseInt(line);
      switch (numTask) {
      case 1: System.out.println("You entered '1'"); break;
      case 2: System.out.println("You entered '2'"); break;
      case 3: System.out.println("You entered '3'"); break;
      case 4: System.out.println("You entered '4'"); break;
      /* replace the above four lines with code to display relevant
   * output for each task
   *
   * in the case of Tasks 1, 2 and 3, get the name of a text file
   * from standard input; in the case of Task 4, get the names of
   * two text files from standard input
   * then, in all cases, read the data from the text file(s) using
   * the FileInput class and build the relevant suffix tree
   * in the case of Tasks 1 and 2, get a string from standard input
   * and convert the string to bytes, with the relevant information
   * stored in the array of bytes from positions 0 onwards
   * then call the relevant method from above to process the
   * information, and display the output using System.out.print
   * and System.out.println */
  default: throw new NumberFormatException();
} } 
catch (NumberFormatException e) {
  if (line.length()==0 || line.charAt(0)!='q')
    System.out.println("You must enter either
else break; 
    }
  } while (true);
  standardInput.close();
} 
'1', '2', '3', '4' or 'q'.");
Secondly, in SuffixTree.java, the second constructor is incomplete, as follows: 
/**
 * Builds a generalised suffix tree for two given strings.
 *
 * @param sInput1 the first string
 * @param sInput2 the second string
 * - assumes that '$' and '#' do not occur as a character
 *   anywhere in sInput1 or sInput2
 * - assumes that characters of sInput1 and sInput2 occupy
 *   positions 0 onwards
 */
public SuffixTree (byte[] sInput1, byte[] sInput2) {
  // to be completed!
}
Thirdly, in SuffixTreeAppl.java, the following four methods are incomplete: 
/**
 * Search the suffix tree t representing string s for a target x.
 * Stores -1 in Task1Info.pos if x is not a substring of s,
 * otherwise stores p in Task1Info.pos such that x occurs in s
 * starting at s[p] (p counts from 0)
 * - assumes that characters of s and x occupy positions 0 onwards
 *
 * @param x the target string to search for
 *
 * @return a Task1Info object
 */
public Task1Info searchSuffixTree(byte[] x) {
  return null; // replace with your code!
} 
/**
 * Search suffix tree t representing string s for all occurrences of target
 * x. Stores in Task2Info.positions a linked list of all such occurrences.
 * Each occurrence is specified by a starting position index in s
 * (as in searchSuffixTree above).  The linked list is empty if there
 * are no occurrences of x in s.
 * - assumes that characters of s and x occupy positions 0 onwards
 *
 * @param x the target string to search for
 *
 * @return a Task2Info object
 */
public Task2Info allOccurrences(byte[] x) {
  return null; // replace with your code!
}
/**
 * Traverse suffix tree t representing string s and stores ln, p1 and
 * p2 in Task3Info.len, Task3Info.pos1 and Task3Info.pos2 respectively,
 * so that s[p1..p1+ln-1] = s[p2..p2+ln-1], with ln maximal;
 * i.e., finds two embeddings of a longest repeated substring of s
 * - assumes that characters of s occupy positions 0 onwards
 * so that p1 and p2 count from 0
 *
 * @return a Task3Info object
 */
public Task3Info traverseForLrs () {
  return null; // replace with your code!
}
/**
 * Traverse generalised suffix tree t representing strings s1 (of length
 * s1Length), and s2, and store ln, p1 and p2 in Task4Info.len,
 * Task4Info.pos1 and Task4Info.pos2 respectively, so that
 * s1[p1..p1+ln-1] = s2[p2..p2+ln-1], with len maximal;
 * i.e., finds embeddings in s1 and s2 of a longest common substring
 * of s1 and s2
 * - assumes that characters of s1 and s2 occupy positions 0 onwards
 * so that p1 and p2 count from 1
 *
 * @param s1Length the length of s1
 *
 * @return a Task4Info object
 */
public Task4Info traverseForLcs (int s1Length) {
  return null; // replace with your code!
}

The Required Output

Your task is to complete the bodies of the above constructor and the above methods so that, when the user enters ‘1’, ‘2’, ‘3’ or ‘4’, together with further information (i.e., name(s) of text file(s) and a search string if appropriate), the output is as follows, respectively: 
  • the starting position of a given search string x in a string s read from a given text file if x occurs in s, or else a message stating that x does not occur in s (if x has multiple occurrences as a substring of s, then the starting position of just one of these occurrences, and not necessarily the first, should be returned); 
  • all occurrences (starting positions; not necessarily in sorted order) of a given search string x in a string s read from a given text file, together with the total number of such occurrences; 

  • a longest repeated substring x of a string s read from a given text file, the length of x and two distinct starting positions of x in s (note: your code should cope with the possibility that s has no repeated substring); 

  • a longest common substring x of strings s1 and s2 read from two text files, the length of x, a starting position of x in s1 and a starting position of x in s2 (note: your code should cope with the possibility that s1 and s2 have no common substring). 

Note that you should complete this exercise without altering any of the instance variables, methods or constructors in the supplied code apart from those explicitly shown in Section 6 above (with the exception that you may find it helpful to add instance variables and methods to SuffixTreeAppl.java, though it should not be necessary to add inner classes here). A small penalty may be applied if you do alter other parts of the supplied code. 

What to submit 

  • A document (written using e.g., Word or LaTeX) containing: A status report which, for any non-working implementation, should state clearly what happens on compilation (in the case of compile-time errors) or on execution (in the case of run-time errors).  A detailed implementation report justifying the operation of each of your methods corresponding to Tasks 1-4 above, making sure to explain how you manipulate the suffix tree data structure in each case. 
  • A pdf file containing a formatted listing of Main.java, SuffixTree.java and SuffixTreeAppl.java. These are the only classes that should be modified. However, if you did find it necessary to modify other classes, you should ensure that you include a listing of these classes, and highlight exactly what additional modifications have been made in your implementation report. In order to produce this pdf file, you should use the following Unix commands: a2ps –A fill –Ma4 fname1.java fname2.java (etc) –o code-listing.psps2pdf -sPAPERSIZE=a4 code-listing.ps
  • Your program code. Be sure to submit all of your classes, and not just the ones that you have altered. Also, ensure that you remove any debugging code that may generate large volumes of output on large input files. 

Get Project Solution Now

Comments