# 22.6 Keyword Search

Alice runs a large corporation. Various employees send in reports containing secret information. These reports need to be sorted and routed to various departments, depending on the subject of the message. Of course, each report could have a set of keywords attached, and the keywords could determine which departments receive the reports. But maybe the keywords are sensitive information, too. Therefore, the keywords need to be encrypted. But if the person who sorts the reports decrypts the keywords, then this person sees the sensitive keywords. It would be good to have the keywords encrypted in a way that an authorized person can search for a certain keyword in a message and determine either that the keyword is present or not, and receive no other information about other keywords. A solution to this problem was found by Boneh, Di Crescenzo, Ostrovsky, and Persiano.

Start with a supersingular elliptic curve $E$ and point ${P}_{0}$ as in Section 22.1. We need two public hash functions:

${H}_{1}$ maps arbitrary length binary strings to multiples of ${P}_{0}\text{.}$ A little care is needed in defining ${H}_{1}\text{,}\text{}$ since no one should be able, given a binary string $b\text{,}\text{}$ to find $k$ with ${H}_{1}(b)=k{P}_{0}\text{.}$ See Exercise 7.

${H}_{2}$ maps the $q$th roots of unity (in the finite field with ${p}^{2}$ elements) to binary strings of fixed length; for example, if the roots of unity are expressed in binary, ${H}_{2}$ can be a standard hash function.

Alice sets up the system as follows. Incoming reports (encrypted) will have attachments consisting of encrypted keywords. Each department will have a set of keyword searches that it is allowed to do. If it finds one of its allotted keywords, then it saves the report.

Alice chooses a secret integer $a$ and computes ${P}_{1}=a{P}_{0}\text{.}$

Alice computes

$${T}_{w}=a\text{\hspace{0.17em}}{H}_{1}(w)\text{.}$$The point ${T}_{w}$ is sent via secure channels to each department that is authorized to search for $w\text{.}$

When Daphne writes a report, she attaches the relevant encrypted keywords to the documents. These encrypted keywords are produced as follows:

Let $w$ be a keyword. Daphne chooses a random integer $r\overline{)\equiv}0\text{\hspace{0.17em}}\text{\hspace{0.17em}}(\text{mod}\text{\hspace{0.17em}}q)$ and computes

$$t=\tilde{e}({H}_{1}(w)\text{,}\text{}r{P}_{1})\text{.}$$The encryption of the keyword $w$ is the pair

$$[r{P}_{0}\text{,}\text{}{H}_{2}(t)]\text{.}$$

A searcher looks at the encrypted keywords $[A\text{,}\text{}b]$ attached to a document and checks

If yes, then the searcher concludes that $w$ is a keyword for that document. If no, then $w$ is not a keyword for it.

Why does this work? If $[A\text{,}\text{}b]$ is the encrypted form of the keyword $w\text{,}\text{}$ then

for some $r\text{.}$ Therefore,

so ${H}_{2}(\tilde{e}({T}_{w}\text{,}\text{}A))={H}_{2}(t)=b\text{.}$

Suppose conversely, that ${w}^{\prime}$ is another keyword. Is it possible that $[A\text{,}\text{}b]$ corresponds to both $w$ and ${w}^{\prime}$? Since the same value of $r$ could be used in the encryptions of $w$ and ${w}^{\prime}\text{,}\text{}$ it is possible that $A$ occurs in encryptions of both $w$ and ${w}^{\prime}\text{.}$ However, we’ll show that in this case the number $b$ comes from at most one of $w$ and ${w}^{\prime}\text{.}$ Since ${H}_{1}$ is collision resistant and ${T}_{{w}^{\prime}}=a\text{\hspace{0.17em}}{H}_{1}({w}^{\prime})$ and ${T}_{w}=a\text{\hspace{0.17em}}{H}_{1}(w)\text{,}\text{}$ we expect that ${T}_{{w}^{\prime}}\ne {T}_{w}$ and $\tilde{e}({T}_{{w}^{\prime}}\text{,}\text{}A)\ne \tilde{e}({T}_{w}\text{,}\text{}A)\text{.}$ (See Exercise 1.) Since ${H}_{2}$ is collision resistant, we expect that

Therefore, $[A\text{,}\text{}b]$ passes the verification equation for at most one keyword $w\text{.}$

Each time that Daphne encrypts the keyword, a different $r$ should be used. Otherwise, the encryption of the keyword will be the same and this can lead to information being leaked. For example, someone could notice that certain keywords occur frequently and make some guesses as to their meanings.

There are many potential applications of this keyword search scheme. For example, suppose a medical researcher wants to find out how many patients at a certain hospital were treated for disease $X$ in the previous year. For privacy reasons, the administration does not want the researcher to obtain any other information about the patients, for example, gender, race, age, and other diseases. The administration could give the researcher ${T}_{X}\text{.}$ Then the researcher could search the encrypted medical records for keyword $X$ without obtaining any information other than the presence or absence of $X\text{.}$