BRICS Basic Research in Computer Science

(1)

B R ICS R S -01-33 P agh & R od ler: L ossy Diction aries

BRICS

Basic Research in Computer Science

Lossy Dictionaries

Rasmus Pagh

Flemming Friche Rodler

BRICS Report Series RS-01-33

ISSN 0909-0878 August 2001

(2)

Copyright c 2001, Rasmus Pagh & Flemming Friche Rodler.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/01/33/

(3)

Rasmus Pagh

∗

and Flemming Friche Rodler

BRICS

†

Department of Computer Science

University of Aarhus, Denmark

{pagh,ffr}@brics.dk

August, 2001

Abstract

Bloomlteringisanimportanttechniqueforspaceecientstorageof

a conservativeapproximationofaset

S

^. ^The^set ^stored ^may ^have^up^to

some speciednumberof falsepositive members, but allelements of

S

are included. In this paper we consider lossy dictionaries that are also

allowedtohavefalsenegatives,i.e., leaveoutelementsof

S

^. ^The^aim^is

to maximizethe weightof includedkeyswithin agiven spaceconstraint.

Thisrelaxationallowsaveryfastandsimpledatastructuremakingalmost

optimal useofmemory. Being moretime ecient thanBloomlters,we

believe our data structure to be well suited for replacing Bloom lters

in some applications. Also, the fact that our data structure supports

information associated tokeyspavesthewayfornewuses, asillustrated

byanapplicationin lossyimagecompression.

1 Introduction

Dictionaries are part of many algorithms and data structures. A dictionary

provides access to information indexed by a set

S

ôf ^keys: ^Given â ^key^, ît

returnsthe associated information or reports thatthe keyis not inthe set. In

this paper we will not be concerned with updates, i.e., we consider the static

dictionary problem. The main parameters of interest are of course the space

usedbythedictionaryandthe timeforlookingupinformation. Wewillassume

keysaswellastheinformation associated withkeys to have a xedsize.

A large literature has grown around the problem of constructing ecient

dictionaries, and theoretically satisfying solutions have been found. Often a

slightly easier problem has been considered, namely the membership problem,

which is the dictionary problem without associated information. It is usually

easy to derive a dictionary from a solution to the membership problem, using

∗

Partially supportedbytheISTProgrammeoftheEUundercontractnumberIST-1999-

14186(ALCOM-FT).

†

BasicResearchinComputerScience(www.brics.dk),fundedbytheDanishNationalRe-

searchFoundation.

(4)

particularlyinterested indictionaryandmembership schemes usinglittlemem-

ory. Let

n

^denote ^the^size ^of ^the^key ^set

S

^. ^It ^has^been ^shown^that ^when ^keys

are

w

^-bit ^machine^words,^lookups^can^be^performedⁱⁿ^constant^timeⁱⁿ^a^mem-

bership data structureoccupying

B + o ( B )

^bits^of ^memory^,^where

B = log ² _n ^w

istheminimumamountofmemoryneededtobeableto representanysubsetof

size

n

^{[2 ]}(logarithmsinthispaperarebase

2

^). ^However,^constant^factorsⁱⁿ^the

lowerorder term and lookuptime make this and similarschemes less than one

couldhopeforfromanappliedpointofview. Also,dicultyofimplementation

is an obstacle to practical use. In total, current schemes with asymptotically

optimalspace usageappearto be mainly oftheoretical interest.

If one relaxes the requirements to themembership datastructure, allowing

it to store a slightly dierent key set than intended, new possibilities arise.

A technique nding many applications in practice is Bloom ltering [1]. This

technique allows space-ecient storage of a superset

S ⁰

^of ^the ^key ^set

S

^, ^such

that

S ⁰ \ S

^is^no^more^than ^an

^fraction^of

{ 0 , 1 } ^w

^. ^F^or

n 2 ^w

^,^about

log(1 / )

bits perkey in

S

âre ^necessary ând ^sucient ^for ^this ^[4]. ^This îs â ^signicant

savings compared to a membership data structure using

B ≈ n log( ² ^w _n ^e )

^bits.

Lookup of a key using Bloom ltering requires

O (log(1 / ))

^memory ^accesses,

andis thus relatively slowcompared to other hashingschemes when

^is^small.

Also,Bloomltering diersfrommost otherhashingtechniquesinthatitdoes

not yield asolution to thedictionaryproblem.

1.1 This Paper

Inthis paper we introduce the concept of lossy dictionaries thatcan have not

only false positives (like Bloomlters), but also false negatives. That is, some

keysin

S

^(with^associated information)arethrownawaywhenconstructing the dictionary. For false positivesthere isno guarantee on theassociated informa-

tionreturned. Weleteachkeyin

S

^have^a^weight,^and^try^to^maximize^the^sum

ofweightsof keysinthe dictionary underagiven spaceconstraint.

We study this problem on a unit cost RAM, in the case where keys are

machine words of

w

^bits, êxaminingâ ^very ^simple ând êcient ^data ^structure

from a theoretical as well as an experimental point of view. Experimentally,

we nd that our data structure has surprisingly good behavior with respect

to keeping the keys of largest weight. The experimental results are partially

explained by our theoretical considerations, under strong assumptions on the

hash functions involved. Specically, we assume that in our RAM model, for

a number of random functions, arbitrary function values can be returned in

constant time by an oracle. We also show that our data structure is nearly

optimalwithrespectto spaceusage.

1.2 Applications

A cache can be seen as a dictionary that stores a small subset of a large key

set, plus associated information. It is thus inherently lossy. Alossy dictionary

allowedtodiscardasmallfractionofakeysetmaythusinmanycasesbeaquite

(5)

allow no false positives. Our lossy dictionary seems best suited for aplications

where thecacheonly changesperiodically,asfor exampleinWebcaching.

Web cache sharing [7 ] is a technique for implementing cooperating caches,

for example Web proxies. When a request arrives at a proxy, it rst checks

whether it can answer therequest. If not, it can forward therequest to other

proxies in the network. However, this increases trac and is rather expensive.

Incooperativecachingeachproxykeepsasummaryofthecontentofallrelevant

proxies available to it. To reduce space requirements, this summary is stored

witha smallfractionoferror usingBloomltering. Often thisreduces network

tracdramatically,sincethereisnomorethanasmallchancethatanexpensive

requestforwardingisperformedinvain. Lossydictionarieswithtwo-sidederror

could be used as a summaryrather than a Bloom lter, since a small fraction

offalse negatives(cache misses)istolerable.

Infact,thegeneralideaofusingin-memorysummariestoreducethenumber

ofexpensiveoperations,suchasI/O's,iswellknowninthedatabasecommunity.

Itdatesatleastbackto[18],whichusesBloomlteringforecientmanagement

ofdierent versionsof databases.

Recently, interestin lossy(volume)data compression withfastrandom ac-

cesstodecodeddatahasarisen[9,11,16,17]. In[17]weshowthatlossydictio-

nariesarewellsuitedfor thispurpose,providinglossystorageofthecoecients

of wavelet transformed data. Compared to the previously bestmethods in the

literature[9,16 ],ourlossydictionarybasedschemetypicallyperforms80%bet-

terintermsofcompressionratio,whilesignicantlyreducingtherandomaccess

time.

1.3 Related Work

Mostprevious workonstaticdictionaries hasconsidered themembershipprob-

lemonaunitcostRAMwithword size

w

^. ^The^rst^membership ^data^structure

withworstcaseconstantlookuptimeusing

O(n)

^words^of^space^wasconstructed by Fredman et al. [8 ]. For constant

δ > 0

^, ^the ^space ^usage ^is

O ( B )

^when

2 ^w > n ^1+δ

^,^butⁱⁿ^general^the^data ^structure^may^use

Ω( Bw )

^bits^of^space. ^The

spaceusagehasbeen lowered to

B + o(B)

^bits^by^Brodnik^and ^Munro^[2]. ^The

lowerorderterm wassubsequentlyimproved to

o ( n ) + O (log w )

^bits ^by^the^rst

author [13 ]. The main concept used in the latter paper is that of a quotient

function

q

^of^a^hash^function

h

^,^dened^to ^be ^a^function^such^that^the^mapping

k 7→ ( h ( k ) , q ( k ))

^is ^injective.

ThemembershipproblemwithfalsepositiveswasrstconsideredbyBloom[1 ].

He described atechnique,now knownasBloom ltering, where lookups return

the conjunction of a number of bits from a bit vector. The locations of the

bit probes are the values of a series of hash functions on the element to be

looked up. Apartfrom Bloomltering thepaperpresents aless spaceecient

datastructure thatisreadily turnedintoa lossydictionarywithonlyfalse pos-

itives. However, thespace usage ofthe derived lossydictionary isnot optimal.

Carteretal.[4 ]providedalowerboundof

n log(1 / )

^bits^on^the^space^needed^to

solve membership withan

^fraction ^false^positives, ^for

n 2 ^w

^,^and ^gave ^data

(6)

Though none of their membership data structures have constant lookup time,

such adata structurefollows byplugging the abovementioned results on space

optimal membership data structures [2, 13 ] into a general reduction provided

in [4]. In fact, the dictionary of [13 ] can be easily modied to a lossy dictio-

nary with false positives, thus also supporting associated information, using

O ( n + log w )

^bits ^more ^than ^the^lower^bound.

Another relaxation of the membership problem was recently considered by

Buhrmanetal.[3]. Theystoretheset

S

^exactly^,^but^allow^the^lookup^procedure

to use randomization and to have some probability of error. For two-sided

error

^they ^show ^that ^there êxistsâ ^data ^structure ôf

O ( nw/ ² )

^bits ⁱⁿ^which

lookups can be done using just one bit probe. It was shown that to do the

samewithoutfalsenegatives,

O ( n ² w/ ² )

^bits^suce,^and^that^this^isessentially optimal. Schemes using more bit probes and lessspace were also investigated.

If one xes the random bits of the lookup procedure appropriately, the result

is a lossy dictionary with error

^. ^However, ît îs ^not ^clear ^how ^to êciently

guaranteethe

^fractionôf^false^positivesⁱⁿâ^reasonable^modelôfcomputation, sothisdoesnot immediately give rise to alossy dictionary.

2 Lossy Dictionaries

Consideraset

S

^containing^keys

k ₁ , . . . , k _n

^with^associatedinformation

a ₁ , . . . , a _n

and positive weights

v ₁ , . . . , v _n

^. ^Suppose ^we âre ^given ân ûpper ^bound

m

^on

available space and an error parameter

> 0

^. ^The ^lossy ^dictionary ^problem

for

= 0

îs ^to ^store â ^subset ôf ^the ^keys ⁱⁿ

S

^and corresponding associated informationinadatastructureof

m

^bits,^trying^to^optimize^the^sum^of^weights

ofincluded keys. For general

^we ^also^allow^the^dictionary^to^contain

2 ^w

^keys

fromthe complement of

S

^. ^In^this ^section ^we^show^the^following ^theorem.

Theorem 1 Letasequenceofkeys

k ₁ , . . . , k _n ∈ { 0 , 1 } ^w

^,^associated information

a ₁ , . . . , a _n ∈ { 0 , 1 } ^l

^, ^and ^weights

v ₁ ≥ · · · ≥ v _n > 0

^be ^given. ^Let

r > 0

^be ^an

even integer, and

b ≥ 0

ân înteger. ^Suppose ^we ^have ôracle âccess ^to ^random

functions

h ₁ , h ₂ : { 0 , 1 } ^w → { 1 , . . . , r/ 2 }

^and corresponding quotient functions

q ₁ , q ₂ : {0, 1} ^w → {0, 1} ^s \ 0 ^s

^. ^There ^is ^a ^lossy ^dictionary ^with ^the ^fol^lowing

properties:

1. The space usage is

r(s − b + l)

^bits ^(two ^tables ^with

r/2

^cells ^of

s − b + l

bits).

2. The fractionof false positives isbounded by

≤ (2 ^b − 1) r/ 2 ^w

^.

3. The expected weight of the keys in the set stored is

P _n

i=1 p _r,i v _i

^where

p r,i ≥

( 1 − 52 r ⁻¹ / ( ^r _i − 2) ,

^for

i < r/ 2 2 (1 − 2 /r ) ⁱ⁻¹ − (1 − 2 /r ) ²⁽ⁱ⁻¹⁾ ,

^for

i ≥ r/ 2

istheprobability that

k _i

îsîncludedⁱⁿ^the^set ^(whichîsindependentof

v _i

^).

(7)

5. The construction time is

O ( n log ^∗ n + rl/w )

^.

As discussed in Section 2.1 there exist quotient functions for

s = w − log( r/ 2) + O (1)

^if^the ^hash ^functions ^map approximately the same number of elementsto each value in

{ 1 , . . . , r/ 2 }

^. ^Theînequality ⁱⁿîtem ²îs ^satised ^for

b = b log(2 ^w /r + 1) c

^,^so^for

s = w − log r + O (1)

^an

^fraction^of^false ^positives

can be achieved using space

r (log( _+r/2 ¹ _w ) + l + O (1))

^. ^As ^can ^be ^seen ^from

item3,almost all of the keys

{k ₁ , . . . , k _r/2 }

âreêxpected^to ^be încluded ⁱⁿ ^the

setrepresented bythe lossy dictionary. For

i ≥ r/ 2

^our ^bound ^on

p i,r

^is^shown

in Figure 5 of Section 3, together with experimentally observed probabilities.

If

n ≥ r

^and ^if

r

îs ^large ênough, ît ^can ^be ^shown ^by integration that, in the expectedsense,morethan

70%

^of^the^keys^from

{k 1 , . . . , k r }

^are^includedⁱⁿ^the

set(ourexperimentsindicate

84%

^). ^Wê^showⁱⁿ^Section^2.5^that^theâmountôf

spacethatwe useto achieve thisis within asmall constant factor ofoptimal.

Note that by setting

b = 0

^we ^obtain ^a ^lossy ^dictionary ^with ^no ^false ^pos-

itives. Another point is that given a desired maximum space usage

m

^and

false positive fraction

^, ^the ^largest ^possible ^size

r

^of ^the ^tables ^can ^be ^usu-

allybe choseneciently. Assume, forexample, thatwe have quotient function

with range

d log(2 ^w+1 /r ) e

^and ^consider ^the ^case

b = 0

^. ^The ^memory ^usage ^is

r ( d log(2 ^w+1 /r ) e + l )

^. ^Whenever

r

^is ^doubled, ^the ^number ^of ^bits ^per ^cell ^be-

comesone less. Thismeans thatthememory usageincreases piecewiselinearly

in

r

^,^with^jumps^when

r

îs â ^powerôf ^two. ^By ^setting

r = 2 ⁱ

^for

i = 1 , 2 , 3 , . . .

we nd the

i

^for ^which

m

îs ^rst êxceeded. ^The ^correct ^value ôf

r

^can ^now

easily be found in the interval

2 ⁱ⁻¹ < r < 2 ⁱ

^. ^F^or ^general

b

^this ^becomes

morecomplicated,asweneedto investigatemoreintervals,but nding

r

^is^still

implementable in

O (log m )

^time.

2.1 Preliminaries

The starting point for the design of our data structure is a static dictionary

recently described in [14 ]. In this dictionary, two hash tables

T ₁

^and

T ₂

^are

used together with two hash functions

h ₁ , h ₂ : {0, 1} ^w → {1, . . . , r/2}

^, ^where

r

^denotes ^the ^combined ^size ôf ^the ^hash ^tables, âssumed ^to ^be êven. Â ^key

x ∈ S

^is ^stored ⁱⁿ^either^cell

h ₁ ( x )

^of

T ₁

^or ^cell

h ₂ ( x )

^of

T ₂

^. ^It^was ^shown^that

if

r ≥ (2 + δ) n

^, ^for ^constant

δ > 0

^, ^and

h ₁ , h ₂

^are ^random ^functions, ^there

existsa wayof arrangingthekeys inthetablesaccordingto thehashfunctions

with probability at least

1 − ⁵² _δr

^. ^F^or ^small

δ

^this ^gives ^a ^dictionary ^utilizing

about 50% of the hash table cells. The arrangement of keys was shown to be

computableinexpectedlineartime.

Another central concept is that of quotient functions. Recall that a quo-

tient function

q

^of ^a ^hash ^function

h

^is ^a ^function ^such ^that ^the ^mapping

k 7→ ( h ( k ) , q ( k ))

îs înjective ^{[13 ].} ^When ^storing â ^key

k

ⁱⁿ ^cell

h ( k )

^of ^a ^hash

table,itissucienttostore

q ( k )

^to^uniquely^identify

k

âmongâllôtherêlements

hashing to

h ( k )

^. ^Tô ^mark êmpty ^cells, ône ^needs â ^bit ^string ^not ^mapped ^to

bythe quotient function, e.g.

0 ^s

^for ^the^quotient ^functions ^of ^Theorem ^1. ^The

(8)

q 1 (k 4 )

q 1 (k 2 )

q 2 (k 3 ) q 2 (k 5 ) q 1 (k 1 )

h 1 (k 2 ) h 1 (k 4 ) h 1 (k 1 )

h 2 (k 3 ) h 2 (k 5 )

Figure1: Exampleofourdatastructure.

ideaof usingquotientfunctionsisthatstoring

q ( k )

^may^require^fewer^bits ^than

storing

k

îtself. Îfâ^fraction

O (1 /r )

ôfâll^possible^keys^hashes ^toêachôf

r

^hash

table cells, there is a quotient function whose function values can be stored in

w − log r + O (1)

^bits. ^Thisâpproach ^wasûsedⁱⁿ^{[13 ]} ^to^construct â^dictionary

usingspace close tothe information theoreticalminimum.

Example We consider the hash function family from [6 ] mapping from

{ 0 , 1 } ^w

^to

{ 0 , 1 } ^t

^,^i.e.,^with

r = 2 ^t

^. ^It^contains^functions^of^the^form

h _a ( k ) = ( ak mod 2 ^w )

^div

2 ^w−t

^for

a

^odd ^and

0 < a < 2 ^w

^. ^A corresponding family of quotient functions is given by

q a ( k ) = ( ak mod 2 ^w )

^mod

2 ^w−t

^, ^whose ^function

values can be stored in

w − log r

^bits.

2.2 Our Data Structure

Theidea behindour lossy dictionary, compared to the static dictionaryof [14 ]

describedabove,istotrytollthehashtablesalmostcompletely,workingwith

keysetsof sizesimilarto or largerthan

r

^. ^Each ^key^has^two^hash^table ^cells^to

which itcan bematched.

Thus, given a pair of hash functions, the problem of nding a maximum

weight subset of

S

^that ^can ^be ârranged ⁱⁿ^to ^the ^hash ^tables îs â ^maximum

weight matching problem that can be solved in polynomial time, see e.g. [5 ].

InSection 2.3we will present analgorithm thatnds such anoptimal solution

intime

O ( n log ^∗ n )

^. ^The^term

O ( rl/w )

ⁱⁿ^the ^time^bound^of ^Theorem ¹ ^is ^the

timeneededto copyassociated information tothetables. Assumefor nowthat

we knowwhich keysareto be representedinwhichhash tablecells.

For

b = 0

^we ^simply ^store ^quotient ^function ^values ⁱⁿ^nonempty ^hash^table

cellsand the value

0 ^s

ⁱⁿ^empty^hash^table ^cells, ^using

s

^bits ^per^cell, ^as^shown

inFigure 1. For general,

b

^we ^store ^only ^the^rst

s − b

^bits. ^Observe ^that ^no

more than

2 ^b

^keys ^with ^the ^same ^hash^function ^value^can ^share ^the ^rst

s − b

bits of thequotient function value. This means that there are at most

2 ^b − 1

false positives for each nonempty cell. Since

0 ^s

îs ^not ⁱⁿ^the ^range, ^this îs âlso

trueforemptycells. Inadditiontothe

s − b

^bits,^we^use

l

^bits ^per^cell ^to ^store

associated information.

Wenowproceedtollintheremainingdetailsonitems3and5ofTheorem1.

(9)

Recall that the task of constructing our data structure boils down to nding

the largest weight arrangement of keys in the tables. Given hash functions

h ₁

^and

h ₂

^and ^a ^key ^set

K

^, ^we ^dene ^the ^bipartite ^graph

G ( K )

^with ^vertex

set

{ 1 , 2 } × { 1 , . . . , r/ 2 }

^, corresponding in a natural way to hash table cells, and the multiset of edges

{{ (1 , h 1 ( k )) , (2 , h 2 ( k )) } | k ∈ K}

^, corresponding to keys. Note thatthere may be parallel edgesif several keys have thesame pair

of hash function values. We will use the terms keys/edges and cells/vertices

synonymously. A connected component of

G ( K )

^is ^dened ^to ^be ^saturated ^if

the number of edges is greater than or equal to the number of vertices, i.e., if

itisnot atree. Wehave thefollowing characterization of thekeysets thatcan

be placedinthe tablesaccording to the given hashfunctions.

Lemma 1 Thekeyset

K

^can^be^placedⁱⁿ^the^tablesîfândônlyîfêach^connected

component of

G ( K )

îs â ^tree, ^plus^possibly ân êxtra êdge.

Proof. By Hall's theorem,

K

^can ^be ^placed ⁱⁿ ^the ^tables îf ând ônly îf êvery

subset

K ⁰ ⊆ K

^satises

|h 1 ( K ⁰ ) | + |h 2 ( K ⁰ ) | ≥ |K ⁰ |

^. ^This îs ^true îf ând ônly

if every subset

K ⁰

^of ^edges ⁱⁿ

G ( K )

^covers ^at ^least

|K ⁰ |

^vertices. ^Since ^it ^is

equivalenttoquantifyonlyoversubsetsofedgeswithinaconnectedcomponent,

thelemmafollows.

2

By an optimal solution for a key set

K

^we ^will ^understand ^a ^maximum

weight subset of

K

^that^can ^be^placed ⁱⁿ^the ^tables.

Lemma 2 There isan optimalsolutionfor

{k ₁ , . . . , k _i }

^including^key

k _i

^if^and

onlyif for any optimal solution

K ⁰

^for

{k ₁ , . . . , k _i−1 }

^, ^the ^set

K ⁰ ∪ {k _i }

^can ^be

placed in the tables.

Proof. If

K ⁰ ∪ {k _i }

^can ^be ^placedⁱⁿ^the^tables^for ^some^solution

K ⁰

^optimal^for

{k ₁ , . . . , k _i−1 }

^,^then

K ⁰ ∪ {k _i }

^must^be ^optimal ^for

{k ₁ , . . . , k _i }

^.

On the other hand,suppose that for some

K ⊆ {k ₁ , . . . , k _i−1 }

^, ^the^key ^set

K ∪ {k _i }

^can ^be ^placed ⁱⁿ ^the ^tables ând ^has ôptimal ^weight, ând ^let

K ⁰

^be

an optimal solution for

{k 1 , . . . , k i−1 }

^. ^Consider ^the ^connected ^components ^of

(1 , h ₁ ( k _i ))

^and

(2 , h ₂ ( k _i ))

ⁱⁿ

G ( K )

^. ^By^Lemma¹^and^since

K∪{k _i }

^can^be^placed

inthetables,atleastoneofthe(possiblyidentical)connectedcomponentsmust

beatree,withoutlossofgeneralitythecomponentof

(1 , h 1 ( k i ))

^. ^Since

K ∪{k i }

is optimal, the connected component of

(1 , h ₁ ( k _i ))

ⁱⁿ

G ( {k ₁ , . . . , k _i−1 } )

^must

alsobea tree. (Iftherewasacycle, akeyofhigherweight couldbesubstituted

for

k i

^,contradicting the optimality of

K ∪ {k i }

^.) ^In particular, the connected componentof

(1 , h ₁ ( k _i ))

ⁱⁿ

G ( K ⁰ )

^is^a^tree. ^Thus,^by^Lemma¹^the^set

K ⁰ ∪{k _i }

canbe placedinthetables.

2

Thelemmaimpliesthatthefollowing greedyalgorithmndsanoptimalkey

set

S ⁰

^given ^keys ^sorted ^according^to nonincreasing weight.

1. Initialize aunion-nd datastructure for the cellsof thehashtables.

2. For eachequivalence class, seta saturated ag to false.

(10)

3. For

i = 1, . . . , n

^:

(a) Find the equivalence classes

c ₁

^of ^cell

h ₁ ( k _i )

ⁱⁿ

T ₁

^, ^and

c ₂

^of ^cell

h ₂ (k _i )

ⁱⁿ

T ₂

^.

(b) If

c 1

^or

c 2

^is^not ^saturated:

i. Include

k _i

ⁱⁿ^the ^solution.

ii. Join

c ₁

^and

c ₂

^to ^form ^anequivalence class

c

^.

iii. Set the saturated ag of

c

^if

c 1 = c 2

^, ôr îf ^the ^saturated âg îs

setfor

c ₁

^or

c ₂

^.

x i

Figure2: The case where

c 1 = c 2

^and ^the ^component ^is nonsaturated. The component becomessaturated.

k i

Figure3: Thecase withonesaturatedandonenonsaturatedcomponent. Thenewcompo-

nentbecomessaturated.

k i

Figure 4: Thecasewithtwosaturatedcomponents. Thenewelementisnotincluded.

In the loop, equivalence classes correspond to the connected components

of the graph

G ( {k ₁ , . . . , k _i−1 } )

^. ^There ^is ^a ^simple implementation of a union- nddatastructure forwhichoperationstake

O (log ^∗ n )

^amortized ^time;^see^{[19 ]}

which actually gives an even better time bound. Figures 2 to 4 show three

possible cases instep3b ofthealgorithm.

What remainsis arrangingtheoptimal keyset

S ⁰

ⁱⁿ ^the^tables. ^Consider ^a

vertexin

G ( S ⁰ )

ôf^degreeône. Îtîs^clear^that^there^must^beânarrangementsuch thatthecorrespondingcell contains thekeyoftheincidentedge. Thus,onecan

(11)

delete them. As we remove the same number of edges and vertices from each

connected component, the remaining graph consists of connected components

with no more edges than vertices and no vertices of degree one, i.e., cycles.

The arrangement of edges in a cycle follows as soon as one key has been put

(arbitrarily) into one of the tables. The above steps are easily implemented to

runinlinear time. Thisestablishes item5 ofTheorem 1.

2.4 Quality of Solution

Wenowturntotheproblemofestimatingthequalityofthesolution. Notethat

the optimal key set returned by our algorithm does not depend on the actual

weights, but only on thesequenceof hash function values. Thus,the expected

weight of our optimalsolution is

P _n

i=1 p _r,i v _i

^, ^where

p _r,i

^is ^theprobabilitythat the

i

^th^keyîsîncludedⁱⁿ^the^returnedôptimal^setôf^keys,^whichîsindependent ofthe weights.

Ouralgorithmincludesallkeys

{k ₁ , . . . , k _i }

ⁱⁿ^the^optimal^solution^returned

if they can all be accommodated under the given hash functions. Using the

resultof [14]mentionedinSection 2.1on

{k ₁ , . . . , k _i }

^with

δ = r/i − 2

^,^we^have

thatfor

i < r/ 2

^this ^happens^withprobabilityat least

1 − 52 r ⁻¹ / ( r/i − 2)

^. ^In

particular,

p _r,i

^is^at ^least^this ^big.

For

i ≥ r/ 2

^we ^derive ^a ^lower ^bound ^on

p r,i

âs ^follows. Îf ône ôf ^the

vertices

(1 , h ₁ ( k _i ))

^and

(2 , h ₂ ( k _i ))

ⁱⁿ

G ( {k ₁ , . . . , k _i−1 } )

^is ^isolated, ^then

k _i

^is

in the optimal solution returned. The randomness assumption on our hash

functionsimpliesthat

G ( {k 1 , . . . , k i−1 } )

^has

i − 1

^randomly ^andindependently chosen edges. Thus, we have the bound

p _r,i ≥ 1 − (1 − (1 − 2 /r ) ⁱ⁻¹ )) ² = 2(1 − 2 /r ) ⁱ⁻¹ − (1 − 2 /r ) ²⁽ⁱ⁻¹⁾ ≈ 2 e ^−i/r − e ^−2i/r

^. ^This establishes item 3 of Theorem1 and concludesthe proof.

2.5 A Lower Bound

Thissection gives a lower bound on the amount of memory needed by a lossy

dictionarywithan

^fraction^of^false^positives^and

γn

^false^negatives. ^Our^proof

technique issimilar to thatusedfor thelowerboundin[4 ] for thecase

γ = 0

^.

Proposition 1 For

0 < < 1 / 2

^and

0 < γ < 1

^, ^a^lossy^dictionary^repr^esenting

a set

S ⊆ {0, 1} ^w

^of

n

^keys,

120 < n ≤ 2 ^w−1

^, ^with ^at ^most

2 ^w

^false ^positives

andat most

γn

^false^negatives ^must ûse^space ôf ât^least

(1 − γ ) n log

1 + n/ 2 ^w

− ⁵ ₂ n

^bits.

Proof. We can assume without loss of generality that

γn

îs înteger ^(this ônly

givesastrongerspacelowerbound),andthat

2 ^w

îsînteger. ^Consider^the^setôf

alldatastructuresusedfor thevarious subsetsof

n

^elements^from

{0, 1} ^w

^. ^Any

ofthese data structuresmust represent a setof at most

2 ^w + n

^keys, ⁱⁿ^order

tomeet the requirement on thenumberof falsepositives. Thus, thenumberof

n

^-element^sets ^having^up^to

γn

^keys^outside^the^setrepresentedbyagiven data

(12)

structure is at most

P _γn

i=0 2 ^w +n n−i

₂ w

i

. Since

< 1/2

^and

n ≤ 2 ^w−1

^we ^have

2 ^w + n ≤ 2 ^w

^, ^and ^so^the ^largest ^term ⁱⁿ^the ^summation ^is

² _n−γn ^w ⁺ⁿ ₂ w

γn

. Thus

we have the upperbound

n ² _n−γn ^w ⁺ⁿ ₂ w

γn

.

Wewillusetheinequalities

( ^a _b ) ^b ≤ ^a _b

< ( ^ae _b ) ^b

^,^see^e.g.^[10,Proposition1.3].

Bytheupperboundonthenumberofsetsrepresentablebyeachdatastructure,

we need,inorder torepresent all

2 ^w n

keysets, spaceat least

log 2 ^w

n

− log

n

2 ^w + n (1 − γ ) n

2 ^w γn

≥ log 2 ^w

n _n

− log n

(2 ^w + n ) e (1 − γ ) n

_(1−γ)n 2 ^w e

γn _γn !

= n log 2 ^w

n

(1 − γ ) n (2 ^w + n ) e

− γn log

(1 − γ ) n (2 ^w + n ) e

2 ^w e γn

− log n

= n log

(1 − γ ) /e + n/ 2 ^w

− γn log

1 − γ γ ( + n/ 2 ^w )

− log n

= (1 − γ ) n log

1 + n/ 2 ^w

− (H(γ) + log e) n − log n

where

H ( γ ) = −γ log γ − (1 − γ ) log(1 − γ ) ≤ 1

^is^the^binary^entropy^function.

For

n > 120

^the ^sum ^of^the^last ^two ^terms^is ^bigger^than

⁵ ₂ n

^.

2

In the discussion following Theorem 1 we noted that if there are quotient

functionswithoptimalrange, thespace usageof our schemeis

n log( _+n/2 ¹ _w ) + O(n)

^when ^tables^of ^combined^size

n

âreûsed. ^The êxpected^fraction

γ

^of^false

negatives islessthan

3 / 10

^by^Theorem ^1. ^This ^means ^that^our ^data^structure

useswithin

O ( n )

^bits^of

10 / 7

^times^the^lower^bound. ^Theexperimentsdescribed inSection 3indicate thatthe truefactor islessthan

6/5

^.

2.6 Using More Tables

Wenowbrieylookatageneralizationofthe two-tableschemetoschemes with

moretables. UnfortunatelythealgorithmdescribedinSection2.3doesnotseem

to generalize tomore than two tables. Anoptimal solutioncan again be found

usingmaximumweightmatching,butthetimecomplexityofthissolutionisnot

attractive. Instead we canuse avariant of thecuckoo scheme described by the

authorsin[15 ], greedily attempting to accommodatekeys inorder

k ₁ , . . . , k _n

^.

For two tablesan insertion attempt for

k _i

^works^as ^follows: ^We ^store

k _i

ⁱⁿ

cell

h ₁ ( k _i )

^of

T ₁

^,^pushing ^the ^previous ôccupant, îfâny^,âwayând ^thus ^making

itnestless. If cell

h 1 ( k i )

^was^free^we âre^done. Ôtherwise ^weînsert^the ^nestless

element in

T ₂

^, ^possibly ^pushing ôut ânother êlement. ^This ^continues ûntil ^we

either nd a free cell or loop around unable to nd a free cell, in which case

k i

îs ^discarded. Ît ^follows ^from ^{[15 ]} ând ^the ânalysis ⁱⁿ ^Section ^2.3 ^that ^this

algorithm nds an optimalsolution, though, not aseciently asthealgorithm

giveninSection2.2. Whenusingthreeormoretablesitisnotobviousinwhich

of thetablesone should attempt placing the nestless key. One heuristic that

works well is to simply pick one of the two possible tables at random. It is

(13)

which willprovably crossanylarge subset oftheverticeswithhighprobability.

Themaindrawbackofusingthreetablesis,ofcourse,thatanothermemory

probe is needed for lookups. Furthermore, as the range of the hashfunctions

must be smaller than when using two tables, the smallest possible range of

quotient functionsislarger, somore spacemaybeneeded for each cell.

3 Experiments

An important performance parameter of our lossydictionaries isthe ability to

store many keys withhigh weight. We tested this ability for lossy dictionaries

usingtwo andthreetables. For comparison,wealsotestedthesimpleone-table

scheme that stores in each cell the key of greatest weight hashing to it. The

testsweredone using trulyrandom hashfunction values, obtainedfrom ahigh

quality collectionof random bits freely available on theInternet [12]. Figure 5

shows experimentally determined values of

p _r,αr

^, ^the probability that the key withindex

i = αr

^is ^stored ⁱⁿ ^thedictionary, determined from

10 ⁴

^trials. ^For

the experiments with one and two tables we used table size

r = 2048

^while

for the experiment with three tableswe used

r = 1536

^. ^W^e ^also ^tried ^various

other table sizes, but the graphs were almost indistinguishable from the ones

shown. From Figure 5 we see thesignicant improvement ofmoving from one

tomoretables. Aspredicted,nearlyallofthe

r/ 2

^heaviest^keys^are^stored^when

usingtwo tables. Forthreetablesthisnumberincreasestoabout

. 88 r

^. ^Of^the

r

heaviest keys,about

84%

âre^stored^when ûsing^two^tables, ând

95%

^are^stored

whenusing threetables.

Apartfromasymptoticallyvanishingdierencesaroundthepointwhere the

curves start falling from

1

^, ^the ^graphs ^of ^Figure ⁵ ^seem independent of

r

^. ^For

twotablestheobservedvalueof

p _r,αr

^for

α > 1 / 2

^isapproximately

3 . 5 / 9 . 6 ^α

^and

forthree tablesitis approximately

8 / 33 ^α

^for

α > 0 . 95

^.

Thegaptothetwo-tablelowerboundofTheorem1canbeexplainedbythe

factthat this lower bound considers only two cells of thehash tables, whereas

opportunitiesfor storingkeys mayappear whenconsidering more cells.

4 Conclusion

Wehaveintroducedtheconceptoflossydictionariesandpresentedasimpleand

ecient data structure implementing a lossy dictionary. Our data structure

combines very ecient lookups and near-optimal space utilization, and thus

seemsapromisingalternativetopreviouslyknowndatastructureswhenasmall

percentage offalse negativesistolerable, suchastheexamplesinSection 1.2.

Though simple and ecient hash functions seem to work well in practice

withourdatastructure,thechallengeofndingsuchfamiliesthatprovablywork

well remains. Furthermore, thelast two graphs in Figure5 are not completely

understood. Thesameistruefortheinsertionheuristicforthreeormoretables.

Acknowledgment We thank Stephen Alstrup and Theis Rauhe for helpful

discussionson the construction of our two table data structure.

(14)

0 0.5 1 1.5 2 2.5 3 0

0.2 0.4 0.6 0.8 1

α One Table

Probability

0 0.5 1 1.5 2 2.5 3

0 0.2 0.4 0.6 0.8 1

Two Tables

Probability

α

0 0.5 1 1.5 2 2.5 3

0 0.2 0.4 0.6 0.8 1

Three Tables

Probability

α

Figure5: Theobservedprobability that the element with

(αr)

^th ^highest ^weight ^is ^stored

whenusingone,twoandthreetables. Fortwotablesourlowerboundisshown.

References

[1] Burton H. Bloom. Space/time trade-os in hash coding with allowable

errors. Communicationsof the ACM,13(7):422426,July 1970.

[2] Andrej Brodnik and J. Ian Munro. Membership in constant time and

almost-minimum space. SIAM J. Comput., 28(5):16271640 (electronic),

1999.

[3] Harry Buhrman, Peter Bro Miltersen, Jaikumar Radhakrishnan, and

S.Venkatesh. Arebitvectors optimal? In Proceedings of the 32nd Annual

ACM Symposium on Theory of Computing (STOC '00), pages 449458.

ACMPress,New York,2000.

[4] LarryCarter,RobertFloyd,JohnGill,GeorgeMarkowsky,andMarkWeg-

man. Exact and approximate membership testers. In Proceedings of the

10thAnnualACMSymposiumon TheoryofComputing (STOC'78),pages

5965. ACM Press,New York,1978.

(15)

AlexanderSchrijver. Combinatorialoptimization. JohnWiley&SonsInc.,

New York, 1998. AWiley-IntersciencePublication.

[6] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti

Penttonen. A reliable randomized algorithm for the closest-pair problem.

Journal of Algorithms,25(1):1951, 1997. doi:10.1006/jagm.1997.0873.

[7] LiFan,PeiCao, JussaraAlmeida,andAndreiZ. Broder. Summarycache:

Ascalablewide-areawebcachesharingprotocol. IEEE/ACMTransactions

on Networking, 8(3):281293, 2000.

[8] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a

sparse table with

O (1)

^worst ^case ^access ^time. ^J. ^Asso^c. ^Comput. ^Mach.,

31(3):538544, 1984.

[9] InsungIhmand SanghunPark. Wavelet-based 3Dcompressionscheme for

very large volume data. GraphicsInterface, pages107116, 1998.

[10] StasysJukna.Extremal CombinatoricswithApplications inComputerSci-

ence. Springer-Verlag, 2000.

[11] Tae-YoungKimandYeongGilShin.Anecientwavelet-basedcompression

methodfor volumerendering. InSeventh PacicConference on Computer

Graphicsand Applications, pages147156, 1999.

[12] George Marsaglia. TheMarsaglia random number CDROM including the

diehard batteryoftests ofrandomness. http://stat.fsu.edu/pub/diehard/.

[13] Rasmus Pagh. Low Redundancy inStatic Dictionarieswith

O (1)

^Lookup

Time. In Proceedings of the 26th International Colloquium on Automata,

Languages and Programming (ICALP '99), volume 1644 of Lecture Notes

in Computer Science, pages595604. Springer-Verlag,Berlin, 1999.

[14] Rasmus Pagh. OntheCell Probe Complexity ofMembership and Perfect

Hashing. In Proceedings of the 33rd Annual ACM Symposium on Theory

of Computing (STOC'01). ACMPress, NewYork,2001.

[15] RasmusPaghandFlemmingFricheRodler. Cuckoo hashing. Toappearin

Proceedingsof ESA 2001,2001.

[16] Flemming Friche Rodler. Wavelet based 3D compression with fast ran-

dom access for very large volume data. In Seventh Pacic Conference on

Computer Graphics and Applications, pages108117, Seoul, Korea, 1999.

[17] FlemmingFricheRodlerandRasmusPagh. Fastrandomaccessto wavelet

compressed volumetric datausinghashing. Manuscript.

[18] Dennis S. Severance and Guy M. Lohman. Dierential les: Their ap-

plication to the maintenance of large data bases. ACM Transactions on

Database Systems, 1(3):256267, September1976.

(16)

rithm. J. Assoc. Comput.Mach., 22:215225, 1975.

(17)

Recent BRICS Report Series Publications

RS-01-33 Rasmus Pagh and Flemming Friche Rodler. Lossy Dictionar- ies. August 2001. 14 pp. Short version appears in Meyer auf der Heide, editor, 9th Annual European Symposiumon on Al- gorithms, ESA ’01 Proceedings, LNCS 2161, 2001, pages 300–

311. RS-01-32 Rasmus Pagh and Flemming Friche Rodler. Cuckoo Hash- ing. August 2001. 21 pp. Short version appears in Meyer auf der Heide, editor, 9th Annual European Symposiumon on Al- gorithms, ESA ’01 Proceedings, LNCS 2161, 2001, pages 121–

133. RS-01-31 Olivier Danvy and Lasse R. Nielsen. Syntactic Theories in Prac- tice. July 2001. 37 pp. Extended version of an article to appear in the informal proceedings of the Second International Work- shop on Rule-Based Programming, RULE 2001 (Firenze, Italy, September 4, 2001).

RS-01-30 Lasse R. Nielsen. A Selective CPS Transformation. July 2001.

24 pp. To appear in Brookes and Mislove, editors, 27th Annual Conference on the Mathematical Foundations of Programming Semantics, MFPS ’01 Proceedings, ENTCS 45, 2000. A prelim- inary version appeared in Brookes and Mislove, editors, Pre- liminary Proceedings of the 17th Annual Conference on Math- ematical Foundations of Programming Semantics, MFPS ’01, (Aarhus, Denmark, May 24–27, 2001), BRICS Notes Series NS-01-2, 2001, pages 201–222.

RS-01-29 Olivier Danvy, Bernd Grobauer, and Morten Rhiger. A Unify- ing Approach to Goal-Directed Evaluation. July 2001. 23 pp.

To appear in New Generation Computing, 20(1), November 2001. A preliminary version appeared in Taha, editor, 2nd International Workshop on Semantics, Applications, and Im- plementation of Program Generation, SAIG ’01 Proceedings, LNCS 2196, 2001, pages 108–125.

BRICS Basic Research in Computer Science

B R ICS R S -01-33 P agh & R od ler: L ossy Diction aries

BRICS

Basic Research in Computer Science

Lossy Dictionaries

Rasmus Pagh

Flemming Friche Rodler

BRICS Report Series RS-01-33

ISSN 0909-0878 August 2001

Copyright c 2001, Rasmus Pagh & Flemming Friche Rodler.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/01/33/

∗

†

S

S

S

S

∗

†

n

S

w

B + o ( B )

B = log 2 n w

n

2

S 0

S

S 0 \ S

{ 0 , 1 } w

n 2 w

log(1 / )

S

B ≈ n log( 2 w n e )

O (log(1 / ))

S

S

w

w

O(n)

δ > 0

O ( B )

2 w > n 1+δ

Ω( Bw )

B + o(B)

o ( n ) + O (log w )

q

h

k 7→ ( h ( k ) , q ( k ))

n log(1 / )

n 2 w

O ( n + log w )

S

O ( nw/ 2 )

O ( n 2 w/ 2 )

S

k 1 , . . . , k n

a 1 , . . . , a n

v 1 , . . . , v n

m

> 0

= 0

S

m

2 w

S

k 1 , . . . , k n ∈ { 0 , 1 } w

a 1 , . . . , a n ∈ { 0 , 1 } l

v 1 ≥ · · · ≥ v n > 0

B = log ² _n ^w

S ⁰

S ⁰ \ S

{ 0 , 1 } ^w

n 2 ^w

B ≈ n log( ² ^w _n ^e )

2 ^w > n ^1+δ

n 2 ^w

O ( nw/ ² )

O ( n ² w/ ² )

k ₁ , . . . , k _n

a ₁ , . . . , a _n

v ₁ , . . . , v _n

2 ^w

k ₁ , . . . , k _n ∈ { 0 , 1 } ^w

a ₁ , . . . , a _n ∈ { 0 , 1 } ^l

v ₁ ≥ · · · ≥ v _n > 0

h ₁ , h ₂ : { 0 , 1 } ^w → { 1 , . . . , r/ 2 }

q ₁ , q ₂ : {0, 1} ^w → {0, 1} ^s \ 0 ^s

≤ (2 ^b − 1) r/ 2 ^w

P _n

i=1 p _r,i v _i

( 1 − 52 r ⁻¹ / ( ^r _i − 2) ,

i < r/ 2 2 (1 − 2 /r ) ⁱ⁻¹ − (1 − 2 /r ) ²⁽ⁱ⁻¹⁾ ,

k _i

v _i

O ( n log ^∗ n + rl/w )

b = b log(2 ^w /r + 1) c

r (log( _+r/2 ¹ _w ) + l + O (1))

{k ₁ , . . . , k _r/2 }

d log(2 ^w+1 /r ) e

r ( d log(2 ^w+1 /r ) e + l )

r = 2 ⁱ

2 ⁱ⁻¹ < r < 2 ⁱ

T ₁

T ₂

h ₁ , h ₂ : {0, 1} ^w → {1, . . . , r/2}

h ₁ ( x )

T ₁

h ₂ ( x )

T ₂

h ₁ , h ₂

1 − ⁵² _δr