Introduction |
The purpose of this article is to show you how to build a simple Seppia
application from scratch. There will not be just one .zip file to install and run but
only instructions to follow. In this way we believe you will be given a more
satisfactory opportunity to familiarize with the Seppia Technology.
These are the steps we will take:
The Application we need to write |
The user requirements are set: "We need a software application able to check for spelling errors in a PDF
file (and of course we need it now)"
In the mind of any experienced developer such requirements can be split into two
separate tasks:
Open Source Projects |
Note: many interesting alternative products for spell-checking and for PDF-manipulation are available.Our choices were purely based on a few minutes evaluation and should not imply or be taken as our preference or recommendation of one software over another.
Building The Application |
1. |
The first thing to do is to give our application a name. Let's call it GBShaw
in honour of the famous writer George Bernard Shaw 1856-1950. Create the directory
C:\gbshaw
unzip Seppia.zip into it so that it looks like this:
|
If we are in the dos-prompt we should be able to type and see the following:
|
2. |
We now want to create the main module for this application.
Let's call it "org.gbshaw.core" and let's create a javascript for it:
module/folder: org.gbshaw.core |
id/file: Main.js |
function main()
{
java.lang.System.out.println(this +" running");
}
|
At the moment this javascript does not do much but later on we will modify it to glue together the work of the other modules.
Let's also not forget to change "org.seppia.bootstrap.StartUp.js" so to launch it.
module/folder: org.seppia.bootstrap |
id/file: StartUp.js |
//
// redirecting to org.gbshaw.core.Main.js
//
function main()
{
run("org.gbshaw.core","Main");
}
|
So the directory structure should look like this now:
|
If we are still in the dos-prompt we should be able to type and see the following:
|
3. |
The next step is to create a module for the spell checker. As we said we have decided to use jazzy (http://jazzy.sourceforge.net/)
At the time of writing this article the version of jazzy is 0.5 and development status is "3-alpha".
Now You need to do the following
Finally write the following javascript file.
module/folder: org.gbshaw.spell |
id/file: SpellChecker.js |
var File = java.io.File;
var HashSet = java.util.HashSet;
var SpellDictionary = Packages.com.swabunga.spell.engine.SpellDictionary;
var SpellDictionaryHashMap = Packages.com.swabunga.spell.engine.SpellDictionaryHashMap;
var SpellCheckEvent = Packages.com.swabunga.spell.event.SpellCheckEvent;
var SpellCheckListener = Packages.com.swabunga.spell.event.SpellCheckListener;
var SpellChecker = Packages.com.swabunga.spell.event.SpellChecker;
var StringWordTokenizer = Packages.com.swabunga.spell.event.StringWordTokenizer;
//
// Creates a JavaScript Object with one method to check the spelling on a text.
//
function main()
{
var obj = new Object();
obj.execute = execute;
return obj;
}
//
// This function receives a String (text) and parses for misspelt words.
//
function execute(text)
{
var url = Packages.org.seppia.core.IO.subURL(module.url,"dictionary/english.0");
var file = new File(url.getFile());
var dictionary = new SpellDictionaryHashMap(file, null);
var spellCheck = new SpellChecker(dictionary);
var swt = new StringWordTokenizer(text);
spellCheck.checkSpelling(swt);
var badWords = new HashSet();
spellCheck.addSpellCheckListener(function spellingError(event)
{
var suggestions = event.getSuggestions();
var invalidWord = event.getInvalidWord();
if (invalidWord.length()<2) return;
if (badWords.contains(invalidWord)) return;
badWords.add(invalidWord);
java.lang.System.out.println("MISSPELT WORD: " + event.getInvalidWord());
});
spellCheck.checkSpelling(new StringWordTokenizer(text));
}
|
The above javascript gives us a mechanism to send a string (a text) in input and get the 'bad' words in output.
We can easily test its correctness by changing org.myspellchecker.core.Main to use it.
module/folder: org.gbshaw.core |
id/file: Main.js |
function main()
{
var spellChecker = run("org.gbshaw.spell","SpellChecker");
spellChecker.execute("What a beautiful day to make a mistaaake");
}
|
If we are still in the dos-prompt we should be able to type and see the following:
|
4. |
We now want create a module for the text extractor from pdf Our choice was on pdfbox
(http://www.pdfbox.org)
downloadable at http://sourceforge.net/project/showfiles.php?group_id=78314
At the time of this article the available version is 0.6.7a. So those are the
instructions:
Finally create the following javascript:
module/folder: org.gbshaw.pdf |
id/file: TextExtractor.js |
// Silencing LOG4J (used internally by PDF Box)
var p = new Properties();
p.setProperty("log4j.rootCategory","OFF");
PropertyConfigurator.configure(p);
//
// Creates a JavaScript Object with just one method to extract text from a pdf file.
//
function main()
{
var obj = new Object();
obj.execute = execute;
return obj;
}
//
// This function receives a file (.PDF) in input and returns String (text)
// with the extracted text.
//
function execute(file)
{
var inputStream = new FileInputStream(file);
var parser = new PDFParser(inputStream);
parser.parse();
inputStream.close();
var document = parser.getPDDocument();
var stripper = new PDFTextStripper();
var sw = new StringWriter();
stripper.writeText(document, sw);
document.close();
return sw.getBuffer().toString();
}
|
to test this module we need to
module/folder: org.gbshaw.core |
id/file: Main.js |
var File = java.io.File;
function main()
{
var textExtractor = run("org.gbshaw.pdf","TextExtractor");
var s = textExtractor.execute(new File("c:\\eclipse-overview.pdf"));
java.lang.System.out.println(s);
}
|
If we are still in the dos-prompt we should be able to type and see the following:
|
5. |
Now that we have written two useful modules the last remaining thing to do is
to glue their functionality. This can be achieved by modifying (once again) the
javascript org.myspellchecker.core.Main
module/folder: org.gbshaw.core |
id/file: Main.js |
var System = java.lang.System;
var File = java.io.File;
function main()
{
var f = new File("c:\\eclipse-overview.pdf")
System.out.println("GB Shaw Started.");
System.out.println("Analyzing File: "+f);
var textExtractor = run("org.gbshaw.pdf","TextExtractor");
var spellChecker = run("org.gbshaw.spell","SpellChecker");
var s = textExtractor.execute(f);
spellChecker.execute(s);
}
|
If we are still in the dos-prompt we should be able to type and see the following:
|
Congratulations you have finished your Seppia Application. Of course there are still things to do (for example we have 'hardcoded' the name of the PDF file in the Main javascript), yet the application meets the requirements we wrote at the beginning of the article and it should not be too difficult going back tidying it up and improving it.
Final Considerations |
Now that your Seppia application is complete you might wonder whether it would have not been better to write it as a standard java application. We think not and we would like to explain why.
Because Seppia technology sets the rules of where to place jars, javascripts other resources your application benefits from an elegant and well-formed directory structure.
The fact that there is no classpath to set or -D properties to pass or string array to pass to the main method follows from it as a valuable benefit.
Therefore, except for providing additional material like readme files, documents and licenses, shipping your application just requires to zip its root directory.
The requirements of your application will change. Rather than parsing just one
.pdf file you might need to be able to work with many
.pdf files, peraphs selecting only files whose names are beginning with a
certain sequence of characters.
In order to address this requirement you simply need to modify the javascript
"Main" in the module "org.gbshaw.core" to do so. And that is it.
There is nothing else to do it. No code to recompile. No new jars to ship or classes to patch.
GBShaw has become popular and now we need to be able to parse .doc files. That's not too difficult !!! If we can find a good third party software able to do most of this work all we need to do is to glue it into our application. Just create a new module to work with .doc files, adds the jar files in the jar directory and write a similar javascript to "org.gbshaw.pdf.TextExtractor" to do the work.