B R ICS R S -01-33 P agh & R od ler: L ossy Diction aries
BRICS
Basic Research in Computer Science
Lossy Dictionaries
Rasmus Pagh
Flemming Friche Rodler
BRICS Report Series RS-01-33
ISSN 0909-0878 August 2001
Copyright c 2001, Rasmus Pagh & Flemming Friche Rodler.
BRICS, Department of Computer Science University of Aarhus. All rights reserved.
Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.
See back inner page for a list of recent BRICS Report Series publications.
Copies may be obtained by contacting:
BRICS
Department of Computer Science University of Aarhus
Ny Munkegade, building 540 DK–8000 Aarhus C
Denmark
Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk
BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:
http://www.brics.dk ftp://ftp.brics.dk
This document in subdirectory RS/01/33/
Rasmus Pagh
∗
and Flemming Friche Rodler
BRICS
†
Department of Computer Science
University of Aarhus, Denmark
{pagh,ffr}@brics.dk
August, 2001
Abstract
Bloomlteringisanimportanttechniqueforspaceecientstorageof
a conservativeapproximationofaset
S
. Theset stored may haveuptosome speciednumberof falsepositive members, but allelements of
S
are included. In this paper we consider lossy dictionaries that are also
allowedtohavefalsenegatives,i.e., leaveoutelementsof
S
. Theaimisto maximizethe weightof includedkeyswithin agiven spaceconstraint.
Thisrelaxationallowsaveryfastandsimpledatastructuremakingalmost
optimal useofmemory. Being moretime ecient thanBloomlters,we
believe our data structure to be well suited for replacing Bloom lters
in some applications. Also, the fact that our data structure supports
information associated tokeyspavesthewayfornewuses, asillustrated
byanapplicationin lossyimagecompression.
1 Introduction
Dictionaries are part of many algorithms and data structures. A dictionary
provides access to information indexed by a set
S
of keys: Given a key, itreturnsthe associated information or reports thatthe keyis not inthe set. In
this paper we will not be concerned with updates, i.e., we consider the static
dictionary problem. The main parameters of interest are of course the space
usedbythedictionaryandthe timeforlookingupinformation. Wewillassume
keysaswellastheinformation associated withkeys to have a xedsize.
A large literature has grown around the problem of constructing ecient
dictionaries, and theoretically satisfying solutions have been found. Often a
slightly easier problem has been considered, namely the membership problem,
which is the dictionary problem without associated information. It is usually
easy to derive a dictionary from a solution to the membership problem, using
∗
Partially supportedbytheISTProgrammeoftheEUundercontractnumberIST-1999-
14186(ALCOM-FT).
†
BasicResearchinComputerScience(www.brics.dk),fundedbytheDanishNationalRe-
searchFoundation.
particularlyinterested indictionaryandmembership schemes usinglittlemem-
ory. Let
n
denote thesize of thekey setS
. It hasbeen shownthat when keysare
w
-bit machinewords,lookupscanbeperformedinconstanttimeinamem-bership data structureoccupying
B + o ( B )
bitsof memory,whereB = log 2 n w
istheminimumamountofmemoryneededtobeableto representanysubsetof
size
n
[2 ](logarithmsinthispaperarebase2
). However,constantfactorsinthelowerorder term and lookuptime make this and similarschemes less than one
couldhopeforfromanappliedpointofview. Also,dicultyofimplementation
is an obstacle to practical use. In total, current schemes with asymptotically
optimalspace usageappearto be mainly oftheoretical interest.
If one relaxes the requirements to themembership datastructure, allowing
it to store a slightly dierent key set than intended, new possibilities arise.
A technique nding many applications in practice is Bloom ltering [1]. This
technique allows space-ecient storage of a superset
S 0
of the key setS
, suchthat
S 0 \ S
isnomorethan anfractionof
{ 0 , 1 } w
. Forn 2 w
,aboutlog(1 / )
bits perkey in
S
are necessary and sucient for this [4]. This is a signicantsavings compared to a membership data structure using
B ≈ n log( 2 w n e )
bits.Lookup of a key using Bloom ltering requires
O (log(1 / ))
memory accesses,andis thus relatively slowcompared to other hashingschemes when
issmall.
Also,Bloomltering diersfrommost otherhashingtechniquesinthatitdoes
not yield asolution to thedictionaryproblem.
1.1 This Paper
Inthis paper we introduce the concept of lossy dictionaries thatcan have not
only false positives (like Bloomlters), but also false negatives. That is, some
keysin
S
(withassociated information)arethrownawaywhenconstructing the dictionary. For false positivesthere isno guarantee on theassociated informa-tionreturned. Weleteachkeyin
S
haveaweight,andtrytomaximizethesumofweightsof keysinthe dictionary underagiven spaceconstraint.
We study this problem on a unit cost RAM, in the case where keys are
machine words of
w
bits, examininga very simple and ecient data structurefrom a theoretical as well as an experimental point of view. Experimentally,
we nd that our data structure has surprisingly good behavior with respect
to keeping the keys of largest weight. The experimental results are partially
explained by our theoretical considerations, under strong assumptions on the
hash functions involved. Specically, we assume that in our RAM model, for
a number of random functions, arbitrary function values can be returned in
constant time by an oracle. We also show that our data structure is nearly
optimalwithrespectto spaceusage.
1.2 Applications
A cache can be seen as a dictionary that stores a small subset of a large key
set, plus associated information. It is thus inherently lossy. Alossy dictionary
allowedtodiscardasmallfractionofakeysetmaythusinmanycasesbeaquite
allow no false positives. Our lossy dictionary seems best suited for aplications
where thecacheonly changesperiodically,asfor exampleinWebcaching.
Web cache sharing [7 ] is a technique for implementing cooperating caches,
for example Web proxies. When a request arrives at a proxy, it rst checks
whether it can answer therequest. If not, it can forward therequest to other
proxies in the network. However, this increases trac and is rather expensive.
Incooperativecachingeachproxykeepsasummaryofthecontentofallrelevant
proxies available to it. To reduce space requirements, this summary is stored
witha smallfractionoferror usingBloomltering. Often thisreduces network
tracdramatically,sincethereisnomorethanasmallchancethatanexpensive
requestforwardingisperformedinvain. Lossydictionarieswithtwo-sidederror
could be used as a summaryrather than a Bloom lter, since a small fraction
offalse negatives(cache misses)istolerable.
Infact,thegeneralideaofusingin-memorysummariestoreducethenumber
ofexpensiveoperations,suchasI/O's,iswellknowninthedatabasecommunity.
Itdatesatleastbackto[18],whichusesBloomlteringforecientmanagement
ofdierent versionsof databases.
Recently, interestin lossy(volume)data compression withfastrandom ac-
cesstodecodeddatahasarisen[9,11,16,17]. In[17]weshowthatlossydictio-
nariesarewellsuitedfor thispurpose,providinglossystorageofthecoecients
of wavelet transformed data. Compared to the previously bestmethods in the
literature[9,16 ],ourlossydictionarybasedschemetypicallyperforms80%bet-
terintermsofcompressionratio,whilesignicantlyreducingtherandomaccess
time.
1.3 Related Work
Mostprevious workonstaticdictionaries hasconsidered themembershipprob-
lemonaunitcostRAMwithword size
w
. Therstmembership datastructurewithworstcaseconstantlookuptimeusing
O(n)
wordsofspacewasconstructed by Fredman et al. [8 ]. For constantδ > 0
, the space usage isO ( B )
when2 w > n 1+δ
,butingeneralthedata structuremayuseΩ( Bw )
bitsofspace. Thespaceusagehasbeen lowered to
B + o(B)
bitsbyBrodnikand Munro[2]. Thelowerorderterm wassubsequentlyimproved to
o ( n ) + O (log w )
bits bytherstauthor [13 ]. The main concept used in the latter paper is that of a quotient
function
q
ofahashfunctionh
,denedto be afunctionsuchthatthemappingk 7→ ( h ( k ) , q ( k ))
is injective.ThemembershipproblemwithfalsepositiveswasrstconsideredbyBloom[1 ].
He described atechnique,now knownasBloom ltering, where lookups return
the conjunction of a number of bits from a bit vector. The locations of the
bit probes are the values of a series of hash functions on the element to be
looked up. Apartfrom Bloomltering thepaperpresents aless spaceecient
datastructure thatisreadily turnedintoa lossydictionarywithonlyfalse pos-
itives. However, thespace usage ofthe derived lossydictionary isnot optimal.
Carteretal.[4 ]providedalowerboundof
n log(1 / )
bitsonthespaceneededtosolve membership withan
fraction falsepositives, for
n 2 w
,and gave dataThough none of their membership data structures have constant lookup time,
such adata structurefollows byplugging the abovementioned results on space
optimal membership data structures [2, 13 ] into a general reduction provided
in [4]. In fact, the dictionary of [13 ] can be easily modied to a lossy dictio-
nary with false positives, thus also supporting associated information, using
O ( n + log w )
bits more than thelowerbound.Another relaxation of the membership problem was recently considered by
Buhrmanetal.[3]. Theystoretheset
S
exactly,butallowthelookupprocedureto use randomization and to have some probability of error. For two-sided
error
they show that there existsa data structure of
O ( nw/ 2 )
bits inwhichlookups can be done using just one bit probe. It was shown that to do the
samewithoutfalsenegatives,
O ( n 2 w/ 2 )
bitssuce,andthatthisisessentially optimal. Schemes using more bit probes and lessspace were also investigated.If one xes the random bits of the lookup procedure appropriately, the result
is a lossy dictionary with error
. However, it is not clear how to eciently
guaranteethe
fractionoffalsepositivesinareasonablemodelofcomputation, sothisdoesnot immediately give rise to alossy dictionary.
2 Lossy Dictionaries
Consideraset
S
containingkeysk 1 , . . . , k n
withassociatedinformationa 1 , . . . , a n
and positive weights
v 1 , . . . , v n
. Suppose we are given an upper boundm
onavailable space and an error parameter
> 0
. The lossy dictionary problemfor
= 0
is to store a subset of the keys inS
and corresponding associated informationinadatastructureofm
bits,tryingtooptimizethesumofweightsofincluded keys. For general
we alsoallowthedictionarytocontain
2 w
keysfromthe complement of
S
. Inthis section weshowthefollowing theorem.Theorem 1 Letasequenceofkeys
k 1 , . . . , k n ∈ { 0 , 1 } w
,associated informationa 1 , . . . , a n ∈ { 0 , 1 } l
, and weightsv 1 ≥ · · · ≥ v n > 0
be given. Letr > 0
be aneven integer, and
b ≥ 0
an integer. Suppose we have oracle access to randomfunctions
h 1 , h 2 : { 0 , 1 } w → { 1 , . . . , r/ 2 }
and corresponding quotient functionsq 1 , q 2 : {0, 1} w → {0, 1} s \ 0 s
. There is a lossy dictionary with the followingproperties:
1. The space usage is
r(s − b + l)
bits (two tables withr/2
cells ofs − b + l
bits).
2. The fractionof false positives isbounded by
≤ (2 b − 1) r/ 2 w
.3. The expected weight of the keys in the set stored is
P n
i=1 p r,i v i
wherep r,i ≥
( 1 − 52 r −1 / ( r i − 2) ,
fori < r/ 2 2 (1 − 2 /r ) i−1 − (1 − 2 /r ) 2(i−1) ,
fori ≥ r/ 2
istheprobability that
k i
isincludedintheset (whichisindependentofv i
).5. The construction time is
O ( n log ∗ n + rl/w )
.As discussed in Section 2.1 there exist quotient functions for
s = w − log( r/ 2) + O (1)
ifthe hash functions map approximately the same number of elementsto each value in{ 1 , . . . , r/ 2 }
. Theinequality initem 2is satised forb = b log(2 w /r + 1) c
,sofors = w − log r + O (1)
anfractionoffalse positives
can be achieved using space
r (log( +r/2 1 w ) + l + O (1))
. As can be seen fromitem3,almost all of the keys
{k 1 , . . . , k r/2 }
areexpectedto be included in thesetrepresented bythe lossy dictionary. For
i ≥ r/ 2
our bound onp i,r
isshownin Figure 5 of Section 3, together with experimentally observed probabilities.
If
n ≥ r
and ifr
is large enough, it can be shown by integration that, in the expectedsense,morethan70%
ofthekeysfrom{k 1 , . . . , k r }
areincludedintheset(ourexperimentsindicate
84%
). WeshowinSection2.5thattheamountofspacethatwe useto achieve thisis within asmall constant factor ofoptimal.
Note that by setting
b = 0
we obtain a lossy dictionary with no false pos-itives. Another point is that given a desired maximum space usage
m
andfalse positive fraction
, the largest possible size
r
of the tables can be usu-allybe choseneciently. Assume, forexample, thatwe have quotient function
with range
d log(2 w+1 /r ) e
and consider the caseb = 0
. The memory usage isr ( d log(2 w+1 /r ) e + l )
. Wheneverr
is doubled, the number of bits per cell be-comesone less. Thismeans thatthememory usageincreases piecewiselinearly
in
r
,withjumpswhenr
is a powerof two. By settingr = 2 i
fori = 1 , 2 , 3 , . . .
we nd the
i
for whichm
is rst exceeded. The correct value ofr
can noweasily be found in the interval
2 i−1 < r < 2 i
. For generalb
this becomesmorecomplicated,asweneedto investigatemoreintervals,but nding
r
isstillimplementable in
O (log m )
time.2.1 Preliminaries
The starting point for the design of our data structure is a static dictionary
recently described in [14 ]. In this dictionary, two hash tables
T 1
andT 2
areused together with two hash functions
h 1 , h 2 : {0, 1} w → {1, . . . , r/2}
, wherer
denotes the combined size of the hash tables, assumed to be even. A keyx ∈ S
is stored ineithercellh 1 ( x )
ofT 1
or cellh 2 ( x )
ofT 2
. Itwas shownthatif
r ≥ (2 + δ) n
, for constantδ > 0
, andh 1 , h 2
are random functions, thereexistsa wayof arrangingthekeys inthetablesaccordingto thehashfunctions
with probability at least
1 − 52 δr
. For smallδ
this gives a dictionary utilizingabout 50% of the hash table cells. The arrangement of keys was shown to be
computableinexpectedlineartime.
Another central concept is that of quotient functions. Recall that a quo-
tient function
q
of a hash functionh
is a function such that the mappingk 7→ ( h ( k ) , q ( k ))
is injective [13 ]. When storing a keyk
in cellh ( k )
of a hashtable,itissucienttostore
q ( k )
touniquelyidentifyk
amongallotherelementshashing to
h ( k )
. To mark empty cells, one needs a bit string not mapped tobythe quotient function, e.g.
0 s
for thequotient functions of Theorem 1. Theq 1 (k 4 )
q 1 (k 2 )
q 2 (k 3 ) q 2 (k 5 ) q 1 (k 1 )
h 1 (k 2 ) h 1 (k 4 ) h 1 (k 1 )
h 2 (k 3 ) h 2 (k 5 )
Figure1: Exampleofourdatastructure.
ideaof usingquotientfunctionsisthatstoring
q ( k )
mayrequirefewerbits thanstoring
k
itself. IfafractionO (1 /r )
ofallpossiblekeyshashes toeachofr
hashtable cells, there is a quotient function whose function values can be stored in
w − log r + O (1)
bits. Thisapproach wasusedin[13 ] toconstruct adictionaryusingspace close tothe information theoreticalminimum.
Example We consider the hash function family from [6 ] mapping from
{ 0 , 1 } w
to{ 0 , 1 } t
,i.e.,withr = 2 t
. Itcontainsfunctionsoftheformh a ( k ) = ( ak mod 2 w )
div2 w−t
fora
odd and0 < a < 2 w
. A corresponding family of quo- tient functions is given byq a ( k ) = ( ak mod 2 w )
mod2 w−t
, whose functionvalues can be stored in
w − log r
bits.2.2 Our Data Structure
Theidea behindour lossy dictionary, compared to the static dictionaryof [14 ]
describedabove,istotrytollthehashtablesalmostcompletely,workingwith
keysetsof sizesimilarto or largerthan
r
. Each keyhastwohashtable cellstowhich itcan bematched.
Thus, given a pair of hash functions, the problem of nding a maximum
weight subset of
S
that can be arranged into the hash tables is a maximumweight matching problem that can be solved in polynomial time, see e.g. [5 ].
InSection 2.3we will present analgorithm thatnds such anoptimal solution
intime
O ( n log ∗ n )
. ThetermO ( rl/w )
inthe timeboundof Theorem 1 is thetimeneededto copyassociated information tothetables. Assumefor nowthat
we knowwhich keysareto be representedinwhichhash tablecells.
For
b = 0
we simply store quotient function values innonempty hashtablecellsand the value
0 s
inemptyhashtable cells, usings
bits percell, asshowninFigure 1. For general,
b
we store only thersts − b
bits. Observe that nomore than
2 b
keys with the same hashfunction valuecan share the rsts − b
bits of thequotient function value. This means that there are at most
2 b − 1
false positives for each nonempty cell. Since
0 s
is not inthe range, this is alsotrueforemptycells. Inadditiontothe
s − b
bits,weusel
bits percell to storeassociated information.
Wenowproceedtollintheremainingdetailsonitems3and5ofTheorem1.
Recall that the task of constructing our data structure boils down to nding
the largest weight arrangement of keys in the tables. Given hash functions
h 1
andh 2
and a key setK
, we dene the bipartite graphG ( K )
with vertexset
{ 1 , 2 } × { 1 , . . . , r/ 2 }
, corresponding in a natural way to hash table cells, and the multiset of edges{{ (1 , h 1 ( k )) , (2 , h 2 ( k )) } | k ∈ K}
, corresponding to keys. Note thatthere may be parallel edgesif several keys have thesame pairof hash function values. We will use the terms keys/edges and cells/vertices
synonymously. A connected component of
G ( K )
is dened to be saturated ifthe number of edges is greater than or equal to the number of vertices, i.e., if
itisnot atree. Wehave thefollowing characterization of thekeysets thatcan
be placedinthe tablesaccording to the given hashfunctions.
Lemma 1 Thekeyset
K
canbeplacedinthetablesifandonlyifeachconnectedcomponent of
G ( K )
is a tree, pluspossibly an extra edge.Proof. By Hall's theorem,
K
can be placed in the tables if and only if everysubset
K 0 ⊆ K
satises|h 1 ( K 0 ) | + |h 2 ( K 0 ) | ≥ |K 0 |
. This is true if and onlyif every subset
K 0
of edges inG ( K )
covers at least|K 0 |
vertices. Since it isequivalenttoquantifyonlyoversubsetsofedgeswithinaconnectedcomponent,
thelemmafollows.
2
By an optimal solution for a key set
K
we will understand a maximumweight subset of
K
thatcan beplaced inthe tables.Lemma 2 There isan optimalsolutionfor
{k 1 , . . . , k i }
includingkeyk i
ifandonlyif for any optimal solution
K 0
for{k 1 , . . . , k i−1 }
, the setK 0 ∪ {k i }
can beplaced in the tables.
Proof. If
K 0 ∪ {k i }
can be placedinthetablesfor somesolutionK 0
optimalfor{k 1 , . . . , k i−1 }
,thenK 0 ∪ {k i }
mustbe optimal for{k 1 , . . . , k i }
.On the other hand,suppose that for some
K ⊆ {k 1 , . . . , k i−1 }
, thekey setK ∪ {k i }
can be placed in the tables and has optimal weight, and letK 0
bean optimal solution for
{k 1 , . . . , k i−1 }
. Consider the connected components of(1 , h 1 ( k i ))
and(2 , h 2 ( k i ))
inG ( K )
. ByLemma1andsinceK∪{k i }
canbeplacedinthetables,atleastoneofthe(possiblyidentical)connectedcomponentsmust
beatree,withoutlossofgeneralitythecomponentof
(1 , h 1 ( k i ))
. SinceK ∪{k i }
is optimal, the connected component of
(1 , h 1 ( k i ))
inG ( {k 1 , . . . , k i−1 } )
mustalsobea tree. (Iftherewasacycle, akeyofhigherweight couldbesubstituted
for
k i
,contradicting the optimality ofK ∪ {k i }
.) In particular, the connected componentof(1 , h 1 ( k i ))
inG ( K 0 )
isatree. Thus,byLemma1thesetK 0 ∪{k i }
canbe placedinthetables.
2
Thelemmaimpliesthatthefollowing greedyalgorithmndsanoptimalkey
set
S 0
given keys sorted accordingto nonincreasing weight.1. Initialize aunion-nd datastructure for the cellsof thehashtables.
2. For eachequivalence class, seta saturated ag to false.
3. For
i = 1, . . . , n
:(a) Find the equivalence classes
c 1
of cellh 1 ( k i )
inT 1
, andc 2
of cellh 2 (k i )
inT 2
.(b) If
c 1
orc 2
isnot saturated:i. Include
k i
inthe solution.ii. Join
c 1
andc 2
to form anequivalence classc
.iii. Set the saturated ag of
c
ifc 1 = c 2
, or if the saturated ag issetfor
c 1
orc 2
.x i
Figure2: The case where
c 1 = c 2
and the component is nonsaturated. The component becomessaturated.k i
Figure3: Thecase withonesaturatedandonenonsaturatedcomponent. Thenewcompo-
nentbecomessaturated.
k i
Figure 4: Thecasewithtwosaturatedcomponents. Thenewelementisnotincluded.
In the loop, equivalence classes correspond to the connected components
of the graph
G ( {k 1 , . . . , k i−1 } )
. There is a simple implementation of a union- nddatastructure forwhichoperationstakeO (log ∗ n )
amortized time;see[19 ]which actually gives an even better time bound. Figures 2 to 4 show three
possible cases instep3b ofthealgorithm.
What remainsis arrangingtheoptimal keyset
S 0
in thetables. Consider avertexin
G ( S 0 )
ofdegreeone. Itisclearthattheremustbeanarrangementsuch thatthecorrespondingcell contains thekeyoftheincidentedge. Thus,onecandelete them. As we remove the same number of edges and vertices from each
connected component, the remaining graph consists of connected components
with no more edges than vertices and no vertices of degree one, i.e., cycles.
The arrangement of edges in a cycle follows as soon as one key has been put
(arbitrarily) into one of the tables. The above steps are easily implemented to
runinlinear time. Thisestablishes item5 ofTheorem 1.
2.4 Quality of Solution
Wenowturntotheproblemofestimatingthequalityofthesolution. Notethat
the optimal key set returned by our algorithm does not depend on the actual
weights, but only on thesequenceof hash function values. Thus,the expected
weight of our optimalsolution is
P n
i=1 p r,i v i
, wherep r,i
is theprobabilitythat thei
thkeyisincludedinthereturnedoptimalsetofkeys,whichisindependent ofthe weights.Ouralgorithmincludesallkeys
{k 1 , . . . , k i }
intheoptimalsolutionreturnedif they can all be accommodated under the given hash functions. Using the
resultof [14]mentionedinSection 2.1on
{k 1 , . . . , k i }
withδ = r/i − 2
,wehavethatfor
i < r/ 2
this happenswithprobabilityat least1 − 52 r −1 / ( r/i − 2)
. Inparticular,
p r,i
isat leastthis big.For
i ≥ r/ 2
we derive a lower bound onp r,i
as follows. If one of thevertices
(1 , h 1 ( k i ))
and(2 , h 2 ( k i ))
inG ( {k 1 , . . . , k i−1 } )
is isolated, thenk i
isin the optimal solution returned. The randomness assumption on our hash
functionsimpliesthat
G ( {k 1 , . . . , k i−1 } )
hasi − 1
randomly andindependently chosen edges. Thus, we have the boundp r,i ≥ 1 − (1 − (1 − 2 /r ) i−1 )) 2 = 2(1 − 2 /r ) i−1 − (1 − 2 /r ) 2(i−1) ≈ 2 e −i/r − e −2i/r
. This establishes item 3 of Theorem1 and concludesthe proof.2.5 A Lower Bound
Thissection gives a lower bound on the amount of memory needed by a lossy
dictionarywithan
fractionoffalsepositivesand
γn
falsenegatives. Ourprooftechnique issimilar to thatusedfor thelowerboundin[4 ] for thecase
γ = 0
.Proposition 1 For
0 < < 1 / 2
and0 < γ < 1
, alossydictionaryrepresentinga set
S ⊆ {0, 1} w
ofn
keys,120 < n ≤ 2 w−1
, with at most2 w
false positivesandat most
γn
falsenegatives must usespace of atleast(1 − γ ) n log
1 + n/ 2 w
− 5 2 n
bits.Proof. We can assume without loss of generality that
γn
is integer (this onlygivesastrongerspacelowerbound),andthat
2 w
isinteger. Considerthesetofalldatastructuresusedfor thevarious subsetsof
n
elementsfrom{0, 1} w
. Anyofthese data structuresmust represent a setof at most
2 w + n
keys, inordertomeet the requirement on thenumberof falsepositives. Thus, thenumberof
n
-elementsets havinguptoγn
keysoutsidethesetrepresentedbyagiven datastructure is at most
P γn
i=0 2 w +n n−i
2 w
i
. Since
< 1/2
andn ≤ 2 w−1
we have2 w + n ≤ 2 w
, and sothe largest term inthe summation is2 n−γn w +n 2 w
γn
. Thus
we have the upperbound
n 2 n−γn w +n 2 w
γn
.
Wewillusetheinequalities
( a b ) b ≤ a b
< ( ae b ) b
,seee.g.[10,Proposition1.3].Bytheupperboundonthenumberofsetsrepresentablebyeachdatastructure,
we need,inorder torepresent all
2 w n
keysets, spaceat least
log 2 w
n
− log
n
2 w + n (1 − γ ) n
2 w γn
≥ log 2 w
n n
− log n
(2 w + n ) e (1 − γ ) n
(1−γ)n 2 w e
γn γn !
= n log 2 w
n
(1 − γ ) n (2 w + n ) e
− γn log
(1 − γ ) n (2 w + n ) e
2 w e γn
− log n
= n log
(1 − γ ) /e + n/ 2 w
− γn log
1 − γ γ ( + n/ 2 w )
− log n
= (1 − γ ) n log
1 + n/ 2 w
− (H(γ) + log e) n − log n
where
H ( γ ) = −γ log γ − (1 − γ ) log(1 − γ ) ≤ 1
isthebinaryentropyfunction.For
n > 120
the sum ofthelast two termsis biggerthan5 2 n
.2
In the discussion following Theorem 1 we noted that if there are quotient
functionswithoptimalrange, thespace usageof our schemeis
n log( +n/2 1 w ) + O(n)
when tablesof combinedsizen
areused. The expectedfractionγ
offalsenegatives islessthan
3 / 10
byTheorem 1. This means thatour datastructureuseswithin
O ( n )
bitsof10 / 7
timesthelowerbound. Theexperimentsdescribed inSection 3indicate thatthe truefactor islessthan6/5
.2.6 Using More Tables
Wenowbrieylookatageneralizationofthe two-tableschemetoschemes with
moretables. UnfortunatelythealgorithmdescribedinSection2.3doesnotseem
to generalize tomore than two tables. Anoptimal solutioncan again be found
usingmaximumweightmatching,butthetimecomplexityofthissolutionisnot
attractive. Instead we canuse avariant of thecuckoo scheme described by the
authorsin[15 ], greedily attempting to accommodatekeys inorder
k 1 , . . . , k n
.For two tablesan insertion attempt for
k i
worksas follows: We storek i
incell
h 1 ( k i )
ofT 1
,pushing the previous occupant, ifany,awayand thus makingitnestless. If cell
h 1 ( k i )
wasfreewe aredone. Otherwise weinsertthe nestlesselement in
T 2
, possibly pushing out another element. This continues until weeither nd a free cell or loop around unable to nd a free cell, in which case
k i
is discarded. It follows from [15 ] and the analysis in Section 2.3 that thisalgorithm nds an optimalsolution, though, not aseciently asthealgorithm
giveninSection2.2. Whenusingthreeormoretablesitisnotobviousinwhich
of thetablesone should attempt placing the nestless key. One heuristic that
works well is to simply pick one of the two possible tables at random. It is
which willprovably crossanylarge subset oftheverticeswithhighprobability.
Themaindrawbackofusingthreetablesis,ofcourse,thatanothermemory
probe is needed for lookups. Furthermore, as the range of the hashfunctions
must be smaller than when using two tables, the smallest possible range of
quotient functionsislarger, somore spacemaybeneeded for each cell.
3 Experiments
An important performance parameter of our lossydictionaries isthe ability to
store many keys withhigh weight. We tested this ability for lossy dictionaries
usingtwo andthreetables. For comparison,wealsotestedthesimpleone-table
scheme that stores in each cell the key of greatest weight hashing to it. The
testsweredone using trulyrandom hashfunction values, obtainedfrom ahigh
quality collectionof random bits freely available on theInternet [12]. Figure 5
shows experimentally determined values of
p r,αr
, the probability that the key withindexi = αr
is stored in thedictionary, determined from10 4
trials. Forthe experiments with one and two tables we used table size
r = 2048
whilefor the experiment with three tableswe used
r = 1536
. We also tried variousother table sizes, but the graphs were almost indistinguishable from the ones
shown. From Figure 5 we see thesignicant improvement ofmoving from one
tomoretables. Aspredicted,nearlyallofthe
r/ 2
heaviestkeysarestoredwhenusingtwo tables. Forthreetablesthisnumberincreasestoabout
. 88 r
. Ofther
heaviest keys,about
84%
arestoredwhen usingtwotables, and95%
arestoredwhenusing threetables.
Apartfromasymptoticallyvanishingdierencesaroundthepointwhere the
curves start falling from
1
, the graphs of Figure 5 seem independent ofr
. Fortwotablestheobservedvalueof
p r,αr
forα > 1 / 2
isapproximately3 . 5 / 9 . 6 α
andforthree tablesitis approximately
8 / 33 α
forα > 0 . 95
.Thegaptothetwo-tablelowerboundofTheorem1canbeexplainedbythe
factthat this lower bound considers only two cells of thehash tables, whereas
opportunitiesfor storingkeys mayappear whenconsidering more cells.
4 Conclusion
Wehaveintroducedtheconceptoflossydictionariesandpresentedasimpleand
ecient data structure implementing a lossy dictionary. Our data structure
combines very ecient lookups and near-optimal space utilization, and thus
seemsapromisingalternativetopreviouslyknowndatastructureswhenasmall
percentage offalse negativesistolerable, suchastheexamplesinSection 1.2.
Though simple and ecient hash functions seem to work well in practice
withourdatastructure,thechallengeofndingsuchfamiliesthatprovablywork
well remains. Furthermore, thelast two graphs in Figure5 are not completely
understood. Thesameistruefortheinsertionheuristicforthreeormoretables.
Acknowledgment We thank Stephen Alstrup and Theis Rauhe for helpful
discussionson the construction of our two table data structure.
0 0.5 1 1.5 2 2.5 3 0
0.2 0.4 0.6 0.8 1
α One Table
Probability
0 0.5 1 1.5 2 2.5 3
0 0.2 0.4 0.6 0.8 1
Two Tables
Probability
α
0 0.5 1 1.5 2 2.5 3
0 0.2 0.4 0.6 0.8 1
Three Tables
Probability
α
Figure5: Theobservedprobability that the element with
(αr)
th highest weight is storedwhenusingone,twoandthreetables. Fortwotablesourlowerboundisshown.
References
[1] Burton H. Bloom. Space/time trade-os in hash coding with allowable
errors. Communicationsof the ACM,13(7):422426,July 1970.
[2] Andrej Brodnik and J. Ian Munro. Membership in constant time and
almost-minimum space. SIAM J. Comput., 28(5):16271640 (electronic),
1999.
[3] Harry Buhrman, Peter Bro Miltersen, Jaikumar Radhakrishnan, and
S.Venkatesh. Arebitvectors optimal? In Proceedings of the 32nd Annual
ACM Symposium on Theory of Computing (STOC '00), pages 449458.
ACMPress,New York,2000.
[4] LarryCarter,RobertFloyd,JohnGill,GeorgeMarkowsky,andMarkWeg-
man. Exact and approximate membership testers. In Proceedings of the
10thAnnualACMSymposiumon TheoryofComputing (STOC'78),pages
5965. ACM Press,New York,1978.
AlexanderSchrijver. Combinatorialoptimization. JohnWiley&SonsInc.,
New York, 1998. AWiley-IntersciencePublication.
[6] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti
Penttonen. A reliable randomized algorithm for the closest-pair problem.
Journal of Algorithms,25(1):1951, 1997. doi:10.1006/jagm.1997.0873.
[7] LiFan,PeiCao, JussaraAlmeida,andAndreiZ. Broder. Summarycache:
Ascalablewide-areawebcachesharingprotocol. IEEE/ACMTransactions
on Networking, 8(3):281293, 2000.
[8] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a
sparse table with
O (1)
worst case access time. J. Assoc. Comput. Mach.,31(3):538544, 1984.
[9] InsungIhmand SanghunPark. Wavelet-based 3Dcompressionscheme for
very large volume data. GraphicsInterface, pages107116, 1998.
[10] StasysJukna.Extremal CombinatoricswithApplications inComputerSci-
ence. Springer-Verlag, 2000.
[11] Tae-YoungKimandYeongGilShin.Anecientwavelet-basedcompression
methodfor volumerendering. InSeventh PacicConference on Computer
Graphicsand Applications, pages147156, 1999.
[12] George Marsaglia. TheMarsaglia random number CDROM including the
diehard batteryoftests ofrandomness. http://stat.fsu.edu/pub/diehard/.
[13] Rasmus Pagh. Low Redundancy inStatic Dictionarieswith
O (1)
LookupTime. In Proceedings of the 26th International Colloquium on Automata,
Languages and Programming (ICALP '99), volume 1644 of Lecture Notes
in Computer Science, pages595604. Springer-Verlag,Berlin, 1999.
[14] Rasmus Pagh. OntheCell Probe Complexity ofMembership and Perfect
Hashing. In Proceedings of the 33rd Annual ACM Symposium on Theory
of Computing (STOC'01). ACMPress, NewYork,2001.
[15] RasmusPaghandFlemmingFricheRodler. Cuckoo hashing. Toappearin
Proceedingsof ESA 2001,2001.
[16] Flemming Friche Rodler. Wavelet based 3D compression with fast ran-
dom access for very large volume data. In Seventh Pacic Conference on
Computer Graphics and Applications, pages108117, Seoul, Korea, 1999.
[17] FlemmingFricheRodlerandRasmusPagh. Fastrandomaccessto wavelet
compressed volumetric datausinghashing. Manuscript.
[18] Dennis S. Severance and Guy M. Lohman. Dierential les: Their ap-
plication to the maintenance of large data bases. ACM Transactions on
Database Systems, 1(3):256267, September1976.
rithm. J. Assoc. Comput.Mach., 22:215225, 1975.