22.6 Keyword Search – Introduction to Cryptography with Coding Theory, 3rd Edition

22.6 Keyword Search

Alice runs a large corporation. Various employees send in reports containing secret information. These reports need to be sorted and routed to various departments, depending on the subject of the message. Of course, each report could have a set of keywords attached, and the keywords could determine which departments receive the reports. But maybe the keywords are sensitive information, too. Therefore, the keywords need to be encrypted. But if the person who sorts the reports decrypts the keywords, then this person sees the sensitive keywords. It would be good to have the keywords encrypted in a way that an authorized person can search for a certain keyword in a message and determine either that the keyword is present or not, and receive no other information about other keywords. A solution to this problem was found by Boneh, Di Crescenzo, Ostrovsky, and Persiano.

Start with a supersingular elliptic curve E and point P0 as in Section 22.1. We need two public hash functions:

  1. H1 maps arbitrary length binary strings to multiples of P0. A little care is needed in defining H1,  since no one should be able, given a binary string b,  to find k with H1(b)=kP0. See Exercise 7.

  2. H2 maps the qth roots of unity (in the finite field with p2 elements) to binary strings of fixed length; for example, if the roots of unity are expressed in binary, H2 can be a standard hash function.

Alice sets up the system as follows. Incoming reports (encrypted) will have attachments consisting of encrypted keywords. Each department will have a set of keyword searches that it is allowed to do. If it finds one of its allotted keywords, then it saves the report.

  1. Alice chooses a secret integer a and computes P1=aP0.

  2. Alice computes


    The point Tw is sent via secure channels to each department that is authorized to search for w.

When Daphne writes a report, she attaches the relevant encrypted keywords to the documents. These encrypted keywords are produced as follows:

  1. Let w be a keyword. Daphne chooses a random integer r0(mod q) and computes

    t=e˜(H1(w), rP1).
  2. The encryption of the keyword w is the pair

    [rP0, H2(t)].

A searcher looks at the encrypted keywords [A, b] attached to a document and checks

H2(e˜(Tw, A))=?b.

If yes, then the searcher concludes that w is a keyword for that document. If no, then w is not a keyword for it.

Why does this work? If [A, b] is the encrypted form of the keyword w,  then

A=rP0 and t=e˜(H1(w), rP1)

for some r. Therefore,

e˜(Tw, A)=e˜(aH1(w), rP0)=e˜(H1(w), rP0)a=e˜(H1(w), rP1)=t, 

so H2(e˜(Tw, A))=H2(t)=b.

Suppose conversely, that w is another keyword. Is it possible that [A, b] corresponds to both w and w? Since the same value of r could be used in the encryptions of w and w,  it is possible that A occurs in encryptions of both w and w. However, we’ll show that in this case the number b comes from at most one of w and w. Since H1 is collision resistant and Tw=aH1(w) and Tw=aH1(w),  we expect that TwTw and e˜(Tw, A)e˜(Tw, A). (See Exercise 1.) Since H2 is collision resistant, we expect that

H2(e˜(Tw, A))H2(e˜(Tw, A)).

Therefore, [A, b] passes the verification equation for at most one keyword w.

Each time that Daphne encrypts the keyword, a different r should be used. Otherwise, the encryption of the keyword will be the same and this can lead to information being leaked. For example, someone could notice that certain keywords occur frequently and make some guesses as to their meanings.

There are many potential applications of this keyword search scheme. For example, suppose a medical researcher wants to find out how many patients at a certain hospital were treated for disease X in the previous year. For privacy reasons, the administration does not want the researcher to obtain any other information about the patients, for example, gender, race, age, and other diseases. The administration could give the researcher TX. Then the researcher could search the encrypted medical records for keyword X without obtaining any information other than the presence or absence of X.