Chapter 9. Collecting the knowledge – Jess in Action: Rule-Based Systems in Java

Chapter 9. Collecting the knowledge

In this chapter you’ll...

  • Learn about knowledge engineering
  • Learn to interview experts
  • Collect requirements
  • Assemble domain knowledge

A journey of a thousand miles begins with the first step.

Lao Tzu

The first step in developing any rule-based system is collecting the knowledge the system will embody. In this chapter, your major concern will be to learn how this can be accomplished. As a practical example, you’ll gather the knowledge you’ll build into your first nontrivial rule-based program.

9.1. The Tax Forms Advisor

For the next three chapters, you’ll be developing a simple rule-based application that recommends United States income tax forms. The application asks the user a series of questions and, based on the answers, tells the user which paper Internal Revenue Service forms she will likely need. You will populate the application with enough data to make it realistic, although you won’t try to make it exhaustive. Your application might be used in an information kiosk at the post office.

You’ll follow a realistic development process as you create this application, starting in this chapter by collecting the actual knowledge. In chapters 10 and 11 you’ll write the application using an iterative methodology, including lots of testing.

The Tax Forms Advisor has a command-line interface. You’ll concentrate on developing the rules themselves, so the entire program will be written in the Jess rule language without using any Java reflection capabilities. In the next part of this book, we’ll examine one way to add a graphical interface to applications like this one.

 

Disclaimer

The system you’re developing is intended only to provide guidelines about what tax forms a taxpayer might need to file. It is not intended to give authoritative legal advice about tax filing.

 

9.2. Introduction to knowledge engineering

Every rule-based system is concerned with some subset of all the world’s collected knowledge. This subset is called the domain of the system. The process of collecting information about a domain for use in a rule-based system is called knowledge engineering, and people who do this for a living are called knowledge engineers. On small projects, the programmers themselves might do all the knowledge engineering, whereas very large projects might include a team of dedicated knowledge engineers.

Professional knowledge engineers may have degrees in a range of disciplines: obvious ones like computer science or psychology, and domain-related ones like physics, chemistry, or mathematics. Obviously it helps if the knowledge engineer knows a lot about rule-based systems, although she doesn’t have to be a programmer.

A good knowledge engineer has to be a jack of all trades, because knowledge engineering is really just learning—the knowledge engineer must learn a lot about the domain in which the proposed system will operate. A knowledge engineer doesn’t need to become an expert, although that sometimes happens. But the knowledge engineer does have to learn something about the topic. In general, this information will include:

  • The requirements— Looking at the problem the system needs to solve is the first step. However, you might not fully understand the problem until later in the process.
  • The principles— You need to learn the organizing principles of the field.
  • The resources— Once you understand the principles, you need to know where to go to learn more.
  • The frontiers— Every domain has its dark corners and dead ends. You need to find out where the tough bits, ambiguities, and limits of human understanding lie.

The knowledge engineer can use many potential sources of information to research these points. Broadly, though, there are two: interviews and desk research. In the rest of this section, we’ll look at techniques for mining each of these information sources to gather the four categories of information we just listed.

9.2.1. Where do you start?

When you’re starting on a new knowledge engineering endeavor, it can be difficult to decide what to do first. Knowledge engineering is an iterative process. You usually can’t make a road map in advance; instead you feel your way along, adjusting your course as you go. As the saying goes, though, a journey of a thousand miles begins with a single step, and taking that first step can be hard.

With most projects, you should first talk to the customers—the people who are paying you to write the system. Find out what their needs are and what resources they can make available. This isn’t knowledge engineering per se, but requirements engineering—part of planning any software project. But the customer might point you to particular sources of technical information and help you plan your approach to knowledge engineering. After talking to the customers, you should have a rough idea of what the system should do and how long development is expected to take.

Next, it’s best to seek out general resources you can use to learn about the fundamentals of the domain and do a bit of self-study. Being at least vaguely familiar with the jargon and fundamental concepts in the domain will let you avoid wasting the time of people you interview later. You should learn enough about the fundamentals to have a rough idea of what kinds of knowledge the system needs to have.

Once you’ve developed an understanding of the basics, you’re ready to begin the iterative process. Based on your initial research, write down a list of questions about the domain which, if answered, would provide knowledge in the areas you previously identified. Seek out a cooperative subject-matter expert, briefly explain the project to him, and ask him the questions (often the customer will provide the expert; otherwise they should pay the expert a consulting fee to work with you). Usually the answers will lead to more questions.

After the initial interview, you can try to organize the information you’ve gathered into some kind of structure—perhaps a written outline or a flowchart. As you do this, you can begin to look for what might turn out to be individual rules. For the Tax Forms Advisor, an individual rule you might encounter early in the process would be (in the Jess language):

(defrule use-ez-form
    ; If filing status is "single", and...
    (filing-status single)
    ; user made less than $50000
    (income ?i&:(< ?i 50000))
    =>
    ; recommend the user file Form 1040EZ
    (recommend 1040EZ))

Detailed comments like those shown here will help non-technical people read and understand the rules, if necessary. Buy a stack of white index cards and write each potential rule on one side of an individual card. Use pencil so you can make changes easily. The cards are useful because they let you group the rules according to function, required inputs, or other criteria. When you have a stack of 100 cards or more, the utility becomes obvious. You can use the reverse sides of the cards to record issues regarding each rule. This stack of cards might be the final product of knowledge engineering, or the cards’ contents might be turned into a report. The cards themselves are often the most useful format, though.

After organizing the new knowledge on index cards, you may see obvious gaps that require additional information. Develop a new set of interview questions and meet with the expert again. The appropriate number of iterations depends on the complexity of the system.

Knowledge engineering doesn’t necessarily end when development begins. After an initial version of a system is available, the expert should try it out as a user and offer advice to correct its performance. If possible, a prototype of the system should be presented to the expert at every interview—except perhaps the first one.

Likewise, development needn’t be deferred until knowledge engineering is complete. For many small projects, the knowledge engineer is one of the developers, and in this case you may be able to dispense with the cards and simply encode the knowledge you collect directly into a prototype system. This is what you’ll do for the Tax Forms Advisor.

More on writing cards

To write down the rule use-ez-form on a card, I had to make up the deftemplate names filing-status and income and also define an imaginary function recommend. In general, you will write rules on these cards in pseudocode; they’re meant to suggest how the real rules might be coded, but they’re just guides. When actual development begins on the system, these early guesses will help the developers figure out what deftemplates and other infrastructure they need to define.

Finally, note that although I wrote use-ez-form in Jess syntax, it would be perfectly OK for a knowledge engineer to use natural language, or pseudocode that looks like some other programming language. If you are a knowledge engineer but not a programmer, writing rules in your native language may be the only option, and that’s fine.

9.2.2. Interviews

People are the best source of information about the requirements for a system. Many projects have requirements documents: written descriptions of how a proposed system should behave. Despite the best intentions, such documents rarely capture the expectations for a system in enough detail to allow the system to be implemented. Often, you can get the missing details only by talking to stakeholders: the customers and potential users of the system.

People can also direct you to books, web sites, and other people who will help you learn about the problem domain. These days it’s common to suffer from information overload when you try to research a topic—there are so many conflicting resources available that it’s hard to know what information to believe. The stakeholders in the system can tell you which resources they trust and which ones they don’t.

If you find conflicting information among otherwise trustworthy references during your research, or hear conflicting statements during interviews, don’t be afraid to ask for clarification. You’ll need a strategy for resolving conflicts that hinge on matters of opinion. Sometimes you can do this by picking a specific person as the ultimate arbiter. Other times, especially on larger projects, it’s appropriate to hold meetings to get the stakeholders to make decisions in a group setting.

Interviewing strategies

Cultivating a good relationship with the people you interview is important. This sounds so simple, you might not think it needs to be said—but it does. Computer people have a culture all their own, and it’s different enough from mainstream culture that programmers can be perceived as rude. If you’re a computer programmer working as a knowledge engineer, you may have to alter your accustomed behavior when you’re interviewing nonprogrammers. Here are a few things to watch out for:

  • Speak their language— It can be difficult for a programmer to remember that stacks, loops, shifts, and pointers are not part of the everyday vocabulary of most nonprogrammers. Don’t use programming terms if you can avoid it. You’ll also want to avoid geek words like grok, kludge, and lossage, which will only distance you from the interviewee. Instead, work hard to learn the technical jargon of the problem domain, and use it properly.
  • Show respect— No matter how trivial the domain may seem, the interviewee knows more about it than you, so don’t look down on people just because they don’t have the same education you do. Your knowledge of programming is not more important than their knowledge of inventory procedures. Your time is not more valuable than theirs. They’re doing you a favor by talking to you, so be grateful.
  • Be interested— Make eye contact when you talk to the interviewee. Ask follow-up questions to show that you’re listening. Take notes so you don’t ask the same question twice (unless, of course, you didn’t understand the answer the first time). Generally look as though you’re happy to be talking to the person—or they won’t talk to you again.
  • Dress for the occasion— Gone are the days when all white-collar workers wore white collars (and ties). But if you’re interviewing someone older than you, she might remember those days quite clearly. If you’re going to interview a client at a bank, don’t show up in sandals and a t-shirt. Dressing appropriately will help your interviewee relate to you.
  • Be reassuring— Often the interviewee is not the customer. A manager may be asking you to capture knowledge from an employee, and that employee may be afraid of being replaced by the proposed new system. Reassure the employee that he’s smarter than any computer, and explain that although the system may take over routine tasks, it will free the employee’s time to work on more important things. You don’t want anyone to perceive you, or the system you’re building, as an enemy.
Customers

The customers are the people who are paying you to build the system. Sometimes they know a lot about the problem, and other times they just want the problem solved. If the customer is also a domain expert, then your job is easy, because the customer can direct you to all the information you need. If the customer doesn’t know much about the problem domain, then the hardest part of your job may be identifying someone who is.

For the forms advisor application, the customer may be the postal service. No one at the post office will be able to supply much domain knowledge, but they will be able to describe the problem well enough. Luckily, it’s obvious in this case who the domain expert should be: a tax accountant. An accountant knows better than anyone else which tax forms people need under various circumstances. The customer should be willing to pay for some of an accountant’s time, or perhaps provide access to their own accountants.

Users

The users are the people who will interact with the system on a day-to-day basis. Like the customers, the users may or may not know much about the domain in which the system works. A particular category of user, the expert user, knows the domain very well. Expert users are people who will use your system to automate tasks they already know how to do. They are often the best kind of interviewee to work with, because they understand the problem and simultaneously know how they want the system to react.

The users for the forms advisor are not expert users—they are just people who wander in to the post office to pick up tax forms. This kind of user isn’t particularly useful to interview for knowledge-engineering purposes; however, it can be useful to talk to naïve users about things like user-interface issues.

Experts

A domain expert is someone who has technical knowledge in the relevant problem area for your system. A good domain expert is worth her weight in gold, so it is important to seek one out and develop a good working relationship. Most of knowledge engineering consists of extracting information from domain experts.

For the forms advisor application, potential experts include accountants and Internal Revenue Service (IRS) workers. An accountant can tell you what forms are required most often by her clients, whereas an IRS employee may have statistics on form usage by the whole U.S. population. Both can help you understand the tax rules.[1]

1 Or maybe not. The U.S. General Accounting Office released a much-publicized report in 2001 relating its findings that IRS telephone personnel give out incorrect tax information 47% of the time.

9.2.3. Desk research

Not all of your information should come from people. When possible, you should instead collect basic or rote knowledge from written materials, so as not to waste other people’s time. Of course, you can’t believe everything you read—make sure the experts you talk to would trust the resources you use.

Books and journals

You might use two broad categories of written material: paper publications and electronic ones. With the explosion of the World Wide Web during the last decade, the amount of electronic research material available has mushroomed. Still, scholarly books and periodicals have a significant advantage over most electronic publications: They are usually peer reviewed. In the peer-review process, material destined for publication is read and critiqued by impartial experts. This process improves the accuracy and trustworthiness of the information.

In many scientific and engineering fields, college textbooks are an excellent way to get an overview of a domain. Introductory textbooks are often aimed at a general audience, so you can read them without a specialized background. The best textbooks have gone through several editions, honing their language and presentation. Monographs on specific topics can also be useful; these are used as texts for advanced college and graduate-level courses. They are sometimes less well written and aimed at an audience with specific technical background. University and technical libraries are a good source for textbooks and monographs.

Professional and scholarly journals are published several times each year, and they are an excellent way to keep up with advances in a particular field. They can be very expensive, so you’ll want to find them in a library as well.

Newsletters, circulars, and other publications aren’t usually peer-reviewed, but they can provide useful information. In particular, many government publications are an invaluable way to learn about laws, regulations, and practices; they combine and distill information from various laws, orders, legal decisions, and policies to produce practical guides.

Web sites and electronic media

You can often find hundreds or even thousands of references by typing a few key words describing your domain into an Internet search engine. There are online encyclopedias of every description, guides to technical fields, troves of engineering data, and countless other valuable resources.

Although the Internet is full of information, it is important to realize that not all of it is correct or unbiased. In particular, many search engines either accept payment for highly placed listings or use a ranking system that is easily fooled into placing a particular page at the top of your search results. Before using a general search engine, learn a little about how it is implemented and operated. Select one that, to the extent possible, ranks results only on their relevance to your search topic. You should also scrutinize individual web pages; check for the source of the information, and try to verify it against another reference.

Sometimes, published electronic reference works on CD-ROM are useful, although they are often simply expensive alternatives to (or worse, a repackaging of) material already available on the Web. Again, you can often find and use these references in libraries.

9.3. Collecting knowledge about tax forms

The domain for the example program is “distributing income tax forms.” The project sponsors might describe it like this:

The system should ask the user a series of questions and then recommend a list of income tax forms the user might need. The list doesn’t need to be exhaustive, but it should be generous—that is, if in doubt, recommend the form. The series of questions should be as short as possible and should never include irrelevant or redundant questions.

This simple statement is certainly enough to get you started on the knowledge engineering phase of the project.

9.3.1. An interview

If you’ve ever filed your own income taxes, you have a reasonable understanding of the concepts behind this application. So, you probably don’t need to do any advance desk work—in fact, you can probably gather all the necessary information from one or two interviews and from reading the forms themselves. The first step is to talk to an accountant and ask her to list the 10 most-used income tax forms. She gives you this list, in no particular order:

  • Form 1040— Income tax
  • Form 1040A— Income tax
  • Form 1040EZ— Income tax
  • Form 2441— Child and dependent care expenses
  • Form 2016EZ— Employee business expenses
  • Form 3903— Moving expenses
  • Form 4684— Casualties and thefts
  • Form 4868— Application for filing extension
  • Form 8283— Noncash charitable distributions
  • Form 8829— Home office expenses

With the list in hand, you can begin asking questions about individual forms. The most relevant question for each form is, “Who needs it?” The accountant’s answers are reproduced here:

  • Form 1040 is the standard long form. Everyone needs it.
  • Form 1040A is the short form. You can use it instead of Form 1040 if your taxable income is less than $50,000. You can’t itemize deductions if you use this form, but you can get a credit for child-care expenses.
  • Form 1040EZ is the really short form. You can use it instead of Form 1040A if you made less than $50,000, you have no dependents, and you don’t itemize deductions. If you’re married, you and your spouse must file a joint return or you can’t use this form.
  • Form 2441 lets you claim a credit for daycare expenses.
  • With Form 2016EZ, you can deduct the unreimbursed part of any expenses you incurred for your employer, primarily travel (except commuting). You can use this short form only if you weren’t reimbursed for any expenses; otherwise you have to use the long form.
  • Form 3903 gets you a deduction for unreimbursed moving expenses if you moved this year because of your job.
  • Form 4684 lets you recover some of your losses during the year—the part that was not covered by insurance. Many people use this form to deduct costs due to car accidents.
  • You fill out Form 4868 to get an extension for filing your taxes. Note that you still have to pay your taxes on time; you can pay an estimated amount with this form.
  • You need to file Form 8283 to get credit for donating more than $500 worth of property to charity.
  • You can file Form 8829 if you have a home office and you want to deduct expenses associated with that office. The rules are fairly restrictive, though—you have to be careful, or you will trigger an audit. You usually shouldn’t file this form unless you are self-employed or your home is your primary workplace.

The accountant’s expert knowledge is evident in a few of these answers, particularly the descriptions of Forms 2016EZ, 4684, and 8829. Also evident, however, are some of the common problems with interview data. The information is not very precise; for example, it is not true that “everyone needs” Form 1040, because there are two alternative forms.

After hearing these interview replies, a few follow-up questions suggest themselves immediately—for instance, you might want to confirm that the long version of Form 2016EZ is Form 2016 (it is). Otherwise, you should be able to get the rest of the information from the forms themselves; if you have questions about the forms, you can arrange another interview.

9.3.2. Reviewing the forms

You can obtain copies of IRS forms from the IRS web site.[2] Studying the forms turns up a few potentially useful facts that the accountant didn’t specify:

2 The IRS web site is at http://www.irs.gov. Alternatively, you can get the forms directly at ftp://ftp.fedworld.gov/pub/irs-pdf/.

  • You can’t file Form 1040EZ if you earned more than $400 in taxable interest. People with more than a certain bank balance (depending on current interest rates) might not be able to use this form.
  • You can only file Form 3903 if you changed work locations and if your new workplace is more than 50 miles further from your old home than your old workplace.
  • With Form 2441, you can get credit for care for an elderly parent or other dependent, not just care for children.

9.3.3. Next steps

You’ve now amassed enough knowledge about the problem domain to write the application, which you will begin to do in the next chapter. You first need to organize the data by defining deftemplates and organize the rules by defining defmodules. You also need to write some infrastructure: functions for input and output, for example. In chapter 11, with the infrastructure in place, you will write the rules and deploy the application.

9.4. Summary

The application area for a rule-based system is called its problem domain. The process of collecting information about a problem domain is called knowledge engineering. Knowledge engineering can include gathering data from interviews, books and other publications, the Internet, and other sources.

You’ve begun work on an application that advises people about the forms they need to use to file their United States federal income taxes. In this chapter, you did the preliminary knowledge engineering, and the end result is several lists of information chunks in prose form.