• Ingen resultater fundet

BRICS Basic Research in Computer Science

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "BRICS Basic Research in Computer Science"

Copied!
17
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

B R ICS R S -01-33 P agh & R od ler: L ossy Diction aries

BRICS

Basic Research in Computer Science

Lossy Dictionaries

Rasmus Pagh

Flemming Friche Rodler

BRICS Report Series RS-01-33

ISSN 0909-0878 August 2001

(2)

Copyright c 2001, Rasmus Pagh & Flemming Friche Rodler.

BRICS, Department of Computer Science University of Aarhus. All rights reserved.

Reproduction of all or part of this work is permitted for educational or research use on condition that this copyright notice is included in any copy.

See back inner page for a list of recent BRICS Report Series publications.

Copies may be obtained by contacting:

BRICS

Department of Computer Science University of Aarhus

Ny Munkegade, building 540 DK–8000 Aarhus C

Denmark

Telephone: +45 8942 3360 Telefax: +45 8942 3255 Internet: BRICS@brics.dk

BRICS publications are in general accessible through the World Wide Web and anonymous FTP through these URLs:

http://www.brics.dk ftp://ftp.brics.dk

This document in subdirectory RS/01/33/

(3)

Rasmus Pagh

and Flemming Friche Rodler

BRICS

Department of Computer Science

University of Aarhus, Denmark

{pagh,ffr}@brics.dk

August, 2001

Abstract

Bloomlteringisanimportanttechniqueforspaceecientstorageof

a conservativeapproximationofaset

S

. Theset stored may haveupto

some speciednumberof falsepositive members, but allelements of

S

are included. In this paper we consider lossy dictionaries that are also

allowedtohavefalsenegatives,i.e., leaveoutelementsof

S

. Theaimis

to maximizethe weightof includedkeyswithin agiven spaceconstraint.

Thisrelaxationallowsaveryfastandsimpledatastructuremakingalmost

optimal useofmemory. Being moretime ecient thanBloomlters,we

believe our data structure to be well suited for replacing Bloom lters

in some applications. Also, the fact that our data structure supports

information associated tokeyspavesthewayfornewuses, asillustrated

byanapplicationin lossyimagecompression.

1 Introduction

Dictionaries are part of many algorithms and data structures. A dictionary

provides access to information indexed by a set

S

of keys: Given a key, it

returnsthe associated information or reports thatthe keyis not inthe set. In

this paper we will not be concerned with updates, i.e., we consider the static

dictionary problem. The main parameters of interest are of course the space

usedbythedictionaryandthe timeforlookingupinformation. Wewillassume

keysaswellastheinformation associated withkeys to have a xedsize.

A large literature has grown around the problem of constructing ecient

dictionaries, and theoretically satisfying solutions have been found. Often a

slightly easier problem has been considered, namely the membership problem,

which is the dictionary problem without associated information. It is usually

easy to derive a dictionary from a solution to the membership problem, using

Partially supportedbytheISTProgrammeoftheEUundercontractnumberIST-1999-

14186(ALCOM-FT).

BasicResearchinComputerScience(www.brics.dk),fundedbytheDanishNationalRe-

searchFoundation.

(4)

particularlyinterested indictionaryandmembership schemes usinglittlemem-

ory. Let

n

denote thesize of thekey set

S

. It hasbeen shownthat when keys

are

w

-bit machinewords,lookupscanbeperformedinconstanttimeinamem-

bership data structureoccupying

B + o ( B )

bitsof memory,where

B = log 2 n w

istheminimumamountofmemoryneededtobeableto representanysubsetof

size

n

[2 ](logarithmsinthispaperarebase

2

). However,constantfactorsinthe

lowerorder term and lookuptime make this and similarschemes less than one

couldhopeforfromanappliedpointofview. Also,dicultyofimplementation

is an obstacle to practical use. In total, current schemes with asymptotically

optimalspace usageappearto be mainly oftheoretical interest.

If one relaxes the requirements to themembership datastructure, allowing

it to store a slightly dierent key set than intended, new possibilities arise.

A technique nding many applications in practice is Bloom ltering [1]. This

technique allows space-ecient storage of a superset

S 0

of the key set

S

, such

that

S 0 \ S

isnomorethan an

fractionof

{ 0 , 1 } w

. For

n 2 w

,about

log(1 / )

bits perkey in

S

are necessary and sucient for this [4]. This is a signicant

savings compared to a membership data structure using

B n log( 2 w n e )

bits.

Lookup of a key using Bloom ltering requires

O (log(1 / ))

memory accesses,

andis thus relatively slowcompared to other hashingschemes when

issmall.

Also,Bloomltering diersfrommost otherhashingtechniquesinthatitdoes

not yield asolution to thedictionaryproblem.

1.1 This Paper

Inthis paper we introduce the concept of lossy dictionaries thatcan have not

only false positives (like Bloomlters), but also false negatives. That is, some

keysin

S

(withassociated information)arethrownawaywhenconstructing the dictionary. For false positivesthere isno guarantee on theassociated informa-

tionreturned. Weleteachkeyin

S

haveaweight,andtrytomaximizethesum

ofweightsof keysinthe dictionary underagiven spaceconstraint.

We study this problem on a unit cost RAM, in the case where keys are

machine words of

w

bits, examininga very simple and ecient data structure

from a theoretical as well as an experimental point of view. Experimentally,

we nd that our data structure has surprisingly good behavior with respect

to keeping the keys of largest weight. The experimental results are partially

explained by our theoretical considerations, under strong assumptions on the

hash functions involved. Specically, we assume that in our RAM model, for

a number of random functions, arbitrary function values can be returned in

constant time by an oracle. We also show that our data structure is nearly

optimalwithrespectto spaceusage.

1.2 Applications

A cache can be seen as a dictionary that stores a small subset of a large key

set, plus associated information. It is thus inherently lossy. Alossy dictionary

allowedtodiscardasmallfractionofakeysetmaythusinmanycasesbeaquite

(5)

allow no false positives. Our lossy dictionary seems best suited for aplications

where thecacheonly changesperiodically,asfor exampleinWebcaching.

Web cache sharing [7 ] is a technique for implementing cooperating caches,

for example Web proxies. When a request arrives at a proxy, it rst checks

whether it can answer therequest. If not, it can forward therequest to other

proxies in the network. However, this increases trac and is rather expensive.

Incooperativecachingeachproxykeepsasummaryofthecontentofallrelevant

proxies available to it. To reduce space requirements, this summary is stored

witha smallfractionoferror usingBloomltering. Often thisreduces network

tracdramatically,sincethereisnomorethanasmallchancethatanexpensive

requestforwardingisperformedinvain. Lossydictionarieswithtwo-sidederror

could be used as a summaryrather than a Bloom lter, since a small fraction

offalse negatives(cache misses)istolerable.

Infact,thegeneralideaofusingin-memorysummariestoreducethenumber

ofexpensiveoperations,suchasI/O's,iswellknowninthedatabasecommunity.

Itdatesatleastbackto[18],whichusesBloomlteringforecientmanagement

ofdierent versionsof databases.

Recently, interestin lossy(volume)data compression withfastrandom ac-

cesstodecodeddatahasarisen[9,11,16,17]. In[17]weshowthatlossydictio-

nariesarewellsuitedfor thispurpose,providinglossystorageofthecoecients

of wavelet transformed data. Compared to the previously bestmethods in the

literature[9,16 ],ourlossydictionarybasedschemetypicallyperforms80%bet-

terintermsofcompressionratio,whilesignicantlyreducingtherandomaccess

time.

1.3 Related Work

Mostprevious workonstaticdictionaries hasconsidered themembershipprob-

lemonaunitcostRAMwithword size

w

. Therstmembership datastructure

withworstcaseconstantlookuptimeusing

O(n)

wordsofspacewasconstructed by Fredman et al. [8 ]. For constant

δ > 0

, the space usage is

O ( B )

when

2 w > n 1+δ

,butingeneralthedata structuremayuse

Ω( Bw )

bitsofspace. The

spaceusagehasbeen lowered to

B + o(B)

bitsbyBrodnikand Munro[2]. The

lowerorderterm wassubsequentlyimproved to

o ( n ) + O (log w )

bits bytherst

author [13 ]. The main concept used in the latter paper is that of a quotient

function

q

ofahashfunction

h

,denedto be afunctionsuchthatthemapping

k 7→ ( h ( k ) , q ( k ))

is injective.

ThemembershipproblemwithfalsepositiveswasrstconsideredbyBloom[1 ].

He described atechnique,now knownasBloom ltering, where lookups return

the conjunction of a number of bits from a bit vector. The locations of the

bit probes are the values of a series of hash functions on the element to be

looked up. Apartfrom Bloomltering thepaperpresents aless spaceecient

datastructure thatisreadily turnedintoa lossydictionarywithonlyfalse pos-

itives. However, thespace usage ofthe derived lossydictionary isnot optimal.

Carteretal.[4 ]providedalowerboundof

n log(1 / )

bitsonthespaceneededto

solve membership withan

fraction falsepositives, for

n 2 w

,and gave data

(6)

Though none of their membership data structures have constant lookup time,

such adata structurefollows byplugging the abovementioned results on space

optimal membership data structures [2, 13 ] into a general reduction provided

in [4]. In fact, the dictionary of [13 ] can be easily modied to a lossy dictio-

nary with false positives, thus also supporting associated information, using

O ( n + log w )

bits more than thelowerbound.

Another relaxation of the membership problem was recently considered by

Buhrmanetal.[3]. Theystoretheset

S

exactly,butallowthelookupprocedure

to use randomization and to have some probability of error. For two-sided

error

they show that there existsa data structure of

O ( nw/ 2 )

bits inwhich

lookups can be done using just one bit probe. It was shown that to do the

samewithoutfalsenegatives,

O ( n 2 w/ 2 )

bitssuce,andthatthisisessentially optimal. Schemes using more bit probes and lessspace were also investigated.

If one xes the random bits of the lookup procedure appropriately, the result

is a lossy dictionary with error

. However, it is not clear how to eciently

guaranteethe

fractionoffalsepositivesinareasonablemodelofcomputation, sothisdoesnot immediately give rise to alossy dictionary.

2 Lossy Dictionaries

Consideraset

S

containingkeys

k 1 , . . . , k n

withassociatedinformation

a 1 , . . . , a n

and positive weights

v 1 , . . . , v n

. Suppose we are given an upper bound

m

on

available space and an error parameter

> 0

. The lossy dictionary problem

for

= 0

is to store a subset of the keys in

S

and corresponding associated informationinadatastructureof

m

bits,tryingtooptimizethesumofweights

ofincluded keys. For general

we alsoallowthedictionarytocontain

2 w

keys

fromthe complement of

S

. Inthis section weshowthefollowing theorem.

Theorem 1 Letasequenceofkeys

k 1 , . . . , k n ∈ { 0 , 1 } w

,associated information

a 1 , . . . , a n ∈ { 0 , 1 } l

, and weights

v 1 ≥ · · · ≥ v n > 0

be given. Let

r > 0

be an

even integer, and

b 0

an integer. Suppose we have oracle access to random

functions

h 1 , h 2 : { 0 , 1 } w → { 1 , . . . , r/ 2 }

and corresponding quotient functions

q 1 , q 2 : {0, 1} w → {0, 1} s \ 0 s

. There is a lossy dictionary with the following

properties:

1. The space usage is

r(s b + l)

bits (two tables with

r/2

cells of

s b + l

bits).

2. The fractionof false positives isbounded by

(2 b 1) r/ 2 w

.

3. The expected weight of the keys in the set stored is

P n

i=1 p r,i v i

where

p r,i

( 1 52 r −1 / ( r i 2) ,

for

i < r/ 2 2 (1 2 /r ) i−1 (1 2 /r ) 2(i−1) ,

for

i r/ 2

istheprobability that

k i

isincludedintheset (whichisindependentof

v i

).

(7)

5. The construction time is

O ( n log n + rl/w )

.

As discussed in Section 2.1 there exist quotient functions for

s = w log( r/ 2) + O (1)

ifthe hash functions map approximately the same number of elementsto each value in

{ 1 , . . . , r/ 2 }

. Theinequality initem 2is satised for

b = b log(2 w /r + 1) c

,sofor

s = w log r + O (1)

an

fractionoffalse positives

can be achieved using space

r (log( +r/2 1 w ) + l + O (1))

. As can be seen from

item3,almost all of the keys

{k 1 , . . . , k r/2 }

areexpectedto be included in the

setrepresented bythe lossy dictionary. For

i r/ 2

our bound on

p i,r

isshown

in Figure 5 of Section 3, together with experimentally observed probabilities.

If

n r

and if

r

is large enough, it can be shown by integration that, in the expectedsense,morethan

70%

ofthekeysfrom

{k 1 , . . . , k r }

areincludedinthe

set(ourexperimentsindicate

84%

). WeshowinSection2.5thattheamountof

spacethatwe useto achieve thisis within asmall constant factor ofoptimal.

Note that by setting

b = 0

we obtain a lossy dictionary with no false pos-

itives. Another point is that given a desired maximum space usage

m

and

false positive fraction

, the largest possible size

r

of the tables can be usu-

allybe choseneciently. Assume, forexample, thatwe have quotient function

with range

d log(2 w+1 /r ) e

and consider the case

b = 0

. The memory usage is

r ( d log(2 w+1 /r ) e + l )

. Whenever

r

is doubled, the number of bits per cell be-

comesone less. Thismeans thatthememory usageincreases piecewiselinearly

in

r

,withjumpswhen

r

is a powerof two. By setting

r = 2 i

for

i = 1 , 2 , 3 , . . .

we nd the

i

for which

m

is rst exceeded. The correct value of

r

can now

easily be found in the interval

2 i−1 < r < 2 i

. For general

b

this becomes

morecomplicated,asweneedto investigatemoreintervals,but nding

r

isstill

implementable in

O (log m )

time.

2.1 Preliminaries

The starting point for the design of our data structure is a static dictionary

recently described in [14 ]. In this dictionary, two hash tables

T 1

and

T 2

are

used together with two hash functions

h 1 , h 2 : {0, 1} w → {1, . . . , r/2}

, where

r

denotes the combined size of the hash tables, assumed to be even. A key

x S

is stored ineithercell

h 1 ( x )

of

T 1

or cell

h 2 ( x )

of

T 2

. Itwas shownthat

if

r (2 + δ) n

, for constant

δ > 0

, and

h 1 , h 2

are random functions, there

existsa wayof arrangingthekeys inthetablesaccordingto thehashfunctions

with probability at least

1 52 δr

. For small

δ

this gives a dictionary utilizing

about 50% of the hash table cells. The arrangement of keys was shown to be

computableinexpectedlineartime.

Another central concept is that of quotient functions. Recall that a quo-

tient function

q

of a hash function

h

is a function such that the mapping

k 7→ ( h ( k ) , q ( k ))

is injective [13 ]. When storing a key

k

in cell

h ( k )

of a hash

table,itissucienttostore

q ( k )

touniquelyidentify

k

amongallotherelements

hashing to

h ( k )

. To mark empty cells, one needs a bit string not mapped to

bythe quotient function, e.g.

0 s

for thequotient functions of Theorem 1. The

(8)

q 1 (k 4 )

q 1 (k 2 )

q 2 (k 3 ) q 2 (k 5 ) q 1 (k 1 )

h 1 (k 2 ) h 1 (k 4 ) h 1 (k 1 )

h 2 (k 3 ) h 2 (k 5 )

Figure1: Exampleofourdatastructure.

ideaof usingquotientfunctionsisthatstoring

q ( k )

mayrequirefewerbits than

storing

k

itself. Ifafraction

O (1 /r )

ofallpossiblekeyshashes toeachof

r

hash

table cells, there is a quotient function whose function values can be stored in

w log r + O (1)

bits. Thisapproach wasusedin[13 ] toconstruct adictionary

usingspace close tothe information theoreticalminimum.

Example We consider the hash function family from [6 ] mapping from

{ 0 , 1 } w

to

{ 0 , 1 } t

,i.e.,with

r = 2 t

. Itcontainsfunctionsoftheform

h a ( k ) = ( ak mod 2 w )

div

2 w−t

for

a

odd and

0 < a < 2 w

. A corresponding family of quo- tient functions is given by

q a ( k ) = ( ak mod 2 w )

mod

2 w−t

, whose function

values can be stored in

w log r

bits.

2.2 Our Data Structure

Theidea behindour lossy dictionary, compared to the static dictionaryof [14 ]

describedabove,istotrytollthehashtablesalmostcompletely,workingwith

keysetsof sizesimilarto or largerthan

r

. Each keyhastwohashtable cellsto

which itcan bematched.

Thus, given a pair of hash functions, the problem of nding a maximum

weight subset of

S

that can be arranged into the hash tables is a maximum

weight matching problem that can be solved in polynomial time, see e.g. [5 ].

InSection 2.3we will present analgorithm thatnds such anoptimal solution

intime

O ( n log n )

. Theterm

O ( rl/w )

inthe timeboundof Theorem 1 is the

timeneededto copyassociated information tothetables. Assumefor nowthat

we knowwhich keysareto be representedinwhichhash tablecells.

For

b = 0

we simply store quotient function values innonempty hashtable

cellsand the value

0 s

inemptyhashtable cells, using

s

bits percell, asshown

inFigure 1. For general,

b

we store only therst

s b

bits. Observe that no

more than

2 b

keys with the same hashfunction valuecan share the rst

s b

bits of thequotient function value. This means that there are at most

2 b 1

false positives for each nonempty cell. Since

0 s

is not inthe range, this is also

trueforemptycells. Inadditiontothe

s b

bits,weuse

l

bits percell to store

associated information.

Wenowproceedtollintheremainingdetailsonitems3and5ofTheorem1.

(9)

Recall that the task of constructing our data structure boils down to nding

the largest weight arrangement of keys in the tables. Given hash functions

h 1

and

h 2

and a key set

K

, we dene the bipartite graph

G ( K )

with vertex

set

{ 1 , 2 } × { 1 , . . . , r/ 2 }

, corresponding in a natural way to hash table cells, and the multiset of edges

{{ (1 , h 1 ( k )) , (2 , h 2 ( k )) } | k K}

, corresponding to keys. Note thatthere may be parallel edgesif several keys have thesame pair

of hash function values. We will use the terms keys/edges and cells/vertices

synonymously. A connected component of

G ( K )

is dened to be saturated if

the number of edges is greater than or equal to the number of vertices, i.e., if

itisnot atree. Wehave thefollowing characterization of thekeysets thatcan

be placedinthe tablesaccording to the given hashfunctions.

Lemma 1 Thekeyset

K

canbeplacedinthetablesifandonlyifeachconnected

component of

G ( K )

is a tree, pluspossibly an extra edge.

Proof. By Hall's theorem,

K

can be placed in the tables if and only if every

subset

K 0 K

satises

|h 1 ( K 0 ) | + |h 2 ( K 0 ) | ≥ |K 0 |

. This is true if and only

if every subset

K 0

of edges in

G ( K )

covers at least

|K 0 |

vertices. Since it is

equivalenttoquantifyonlyoversubsetsofedgeswithinaconnectedcomponent,

thelemmafollows.

2

By an optimal solution for a key set

K

we will understand a maximum

weight subset of

K

thatcan beplaced inthe tables.

Lemma 2 There isan optimalsolutionfor

{k 1 , . . . , k i }

includingkey

k i

ifand

onlyif for any optimal solution

K 0

for

{k 1 , . . . , k i−1 }

, the set

K 0 ∪ {k i }

can be

placed in the tables.

Proof. If

K 0 ∪ {k i }

can be placedinthetablesfor somesolution

K 0

optimalfor

{k 1 , . . . , k i−1 }

,then

K 0 ∪ {k i }

mustbe optimal for

{k 1 , . . . , k i }

.

On the other hand,suppose that for some

K ⊆ {k 1 , . . . , k i−1 }

, thekey set

K ∪ {k i }

can be placed in the tables and has optimal weight, and let

K 0

be

an optimal solution for

{k 1 , . . . , k i−1 }

. Consider the connected components of

(1 , h 1 ( k i ))

and

(2 , h 2 ( k i ))

in

G ( K )

. ByLemma1andsince

K∪{k i }

canbeplaced

inthetables,atleastoneofthe(possiblyidentical)connectedcomponentsmust

beatree,withoutlossofgeneralitythecomponentof

(1 , h 1 ( k i ))

. Since

K ∪{k i }

is optimal, the connected component of

(1 , h 1 ( k i ))

in

G ( {k 1 , . . . , k i−1 } )

must

alsobea tree. (Iftherewasacycle, akeyofhigherweight couldbesubstituted

for

k i

,contradicting the optimality of

K ∪ {k i }

.) In particular, the connected componentof

(1 , h 1 ( k i ))

in

G ( K 0 )

isatree. Thus,byLemma1theset

K 0 ∪{k i }

canbe placedinthetables.

2

Thelemmaimpliesthatthefollowing greedyalgorithmndsanoptimalkey

set

S 0

given keys sorted accordingto nonincreasing weight.

1. Initialize aunion-nd datastructure for the cellsof thehashtables.

2. For eachequivalence class, seta saturated ag to false.

(10)

3. For

i = 1, . . . , n

:

(a) Find the equivalence classes

c 1

of cell

h 1 ( k i )

in

T 1

, and

c 2

of cell

h 2 (k i )

in

T 2

.

(b) If

c 1

or

c 2

isnot saturated:

i. Include

k i

inthe solution.

ii. Join

c 1

and

c 2

to form anequivalence class

c

.

iii. Set the saturated ag of

c

if

c 1 = c 2

, or if the saturated ag is

setfor

c 1

or

c 2

.

x i

Figure2: The case where

c 1 = c 2

and the component is nonsaturated. The component becomessaturated.

k i

Figure3: Thecase withonesaturatedandonenonsaturatedcomponent. Thenewcompo-

nentbecomessaturated.

k i

Figure 4: Thecasewithtwosaturatedcomponents. Thenewelementisnotincluded.

In the loop, equivalence classes correspond to the connected components

of the graph

G ( {k 1 , . . . , k i−1 } )

. There is a simple implementation of a union- nddatastructure forwhichoperationstake

O (log n )

amortized time;see[19 ]

which actually gives an even better time bound. Figures 2 to 4 show three

possible cases instep3b ofthealgorithm.

What remainsis arrangingtheoptimal keyset

S 0

in thetables. Consider a

vertexin

G ( S 0 )

ofdegreeone. Itisclearthattheremustbeanarrangementsuch thatthecorrespondingcell contains thekeyoftheincidentedge. Thus,onecan

(11)

delete them. As we remove the same number of edges and vertices from each

connected component, the remaining graph consists of connected components

with no more edges than vertices and no vertices of degree one, i.e., cycles.

The arrangement of edges in a cycle follows as soon as one key has been put

(arbitrarily) into one of the tables. The above steps are easily implemented to

runinlinear time. Thisestablishes item5 ofTheorem 1.

2.4 Quality of Solution

Wenowturntotheproblemofestimatingthequalityofthesolution. Notethat

the optimal key set returned by our algorithm does not depend on the actual

weights, but only on thesequenceof hash function values. Thus,the expected

weight of our optimalsolution is

P n

i=1 p r,i v i

, where

p r,i

is theprobabilitythat the

i

thkeyisincludedinthereturnedoptimalsetofkeys,whichisindependent ofthe weights.

Ouralgorithmincludesallkeys

{k 1 , . . . , k i }

intheoptimalsolutionreturned

if they can all be accommodated under the given hash functions. Using the

resultof [14]mentionedinSection 2.1on

{k 1 , . . . , k i }

with

δ = r/i 2

,wehave

thatfor

i < r/ 2

this happenswithprobabilityat least

1 52 r −1 / ( r/i 2)

. In

particular,

p r,i

isat leastthis big.

For

i r/ 2

we derive a lower bound on

p r,i

as follows. If one of the

vertices

(1 , h 1 ( k i ))

and

(2 , h 2 ( k i ))

in

G ( {k 1 , . . . , k i−1 } )

is isolated, then

k i

is

in the optimal solution returned. The randomness assumption on our hash

functionsimpliesthat

G ( {k 1 , . . . , k i−1 } )

has

i 1

randomly andindependently chosen edges. Thus, we have the bound

p r,i 1 (1 (1 2 /r ) i−1 )) 2 = 2(1 2 /r ) i−1 (1 2 /r ) 2(i−1) 2 e −i/r e −2i/r

. This establishes item 3 of Theorem1 and concludesthe proof.

2.5 A Lower Bound

Thissection gives a lower bound on the amount of memory needed by a lossy

dictionarywithan

fractionoffalsepositivesand

γn

falsenegatives. Ourproof

technique issimilar to thatusedfor thelowerboundin[4 ] for thecase

γ = 0

.

Proposition 1 For

0 < < 1 / 2

and

0 < γ < 1

, alossydictionaryrepresenting

a set

S ⊆ {0, 1} w

of

n

keys,

120 < n 2 w−1

, with at most

2 w

false positives

andat most

γn

falsenegatives must usespace of atleast

(1 γ ) n log

1 + n/ 2 w

5 2 n

bits.

Proof. We can assume without loss of generality that

γn

is integer (this only

givesastrongerspacelowerbound),andthat

2 w

isinteger. Considerthesetof

alldatastructuresusedfor thevarious subsetsof

n

elementsfrom

{0, 1} w

. Any

ofthese data structuresmust represent a setof at most

2 w + n

keys, inorder

tomeet the requirement on thenumberof falsepositives. Thus, thenumberof

n

-elementsets havingupto

γn

keysoutsidethesetrepresentedbyagiven data

(12)

structure is at most

P γn

i=0 2 w +n n−i

2 w

i

. Since

< 1/2

and

n 2 w−1

we have

2 w + n 2 w

, and sothe largest term inthe summation is

2 n−γn w +n 2 w

γn

. Thus

we have the upperbound

n 2 n−γn w +n 2 w

γn

.

Wewillusetheinequalities

( a b ) b a b

< ( ae b ) b

,seee.g.[10,Proposition1.3].

Bytheupperboundonthenumberofsetsrepresentablebyeachdatastructure,

we need,inorder torepresent all

2 w n

keysets, spaceat least

log 2 w

n

log

n

2 w + n (1 γ ) n

2 w γn

log 2 w

n n

log n

(2 w + n ) e (1 γ ) n

(1−γ)n 2 w e

γn γn !

= n log 2 w

n

(1 γ ) n (2 w + n ) e

γn log

(1 γ ) n (2 w + n ) e

2 w e γn

log n

= n log

(1 γ ) /e + n/ 2 w

γn log

1 γ γ ( + n/ 2 w )

log n

= (1 γ ) n log

1 + n/ 2 w

(H(γ) + log e) n log n

where

H ( γ ) = −γ log γ (1 γ ) log(1 γ ) 1

isthebinaryentropyfunction.

For

n > 120

the sum ofthelast two termsis biggerthan

5 2 n

.

2

In the discussion following Theorem 1 we noted that if there are quotient

functionswithoptimalrange, thespace usageof our schemeis

n log( +n/2 1 w ) + O(n)

when tablesof combinedsize

n

areused. The expectedfraction

γ

offalse

negatives islessthan

3 / 10

byTheorem 1. This means thatour datastructure

useswithin

O ( n )

bitsof

10 / 7

timesthelowerbound. Theexperimentsdescribed inSection 3indicate thatthe truefactor islessthan

6/5

.

2.6 Using More Tables

Wenowbrieylookatageneralizationofthe two-tableschemetoschemes with

moretables. UnfortunatelythealgorithmdescribedinSection2.3doesnotseem

to generalize tomore than two tables. Anoptimal solutioncan again be found

usingmaximumweightmatching,butthetimecomplexityofthissolutionisnot

attractive. Instead we canuse avariant of thecuckoo scheme described by the

authorsin[15 ], greedily attempting to accommodatekeys inorder

k 1 , . . . , k n

.

For two tablesan insertion attempt for

k i

worksas follows: We store

k i

in

cell

h 1 ( k i )

of

T 1

,pushing the previous occupant, ifany,awayand thus making

itnestless. If cell

h 1 ( k i )

wasfreewe aredone. Otherwise weinsertthe nestless

element in

T 2

, possibly pushing out another element. This continues until we

either nd a free cell or loop around unable to nd a free cell, in which case

k i

is discarded. It follows from [15 ] and the analysis in Section 2.3 that this

algorithm nds an optimalsolution, though, not aseciently asthealgorithm

giveninSection2.2. Whenusingthreeormoretablesitisnotobviousinwhich

of thetablesone should attempt placing the nestless key. One heuristic that

works well is to simply pick one of the two possible tables at random. It is

(13)

which willprovably crossanylarge subset oftheverticeswithhighprobability.

Themaindrawbackofusingthreetablesis,ofcourse,thatanothermemory

probe is needed for lookups. Furthermore, as the range of the hashfunctions

must be smaller than when using two tables, the smallest possible range of

quotient functionsislarger, somore spacemaybeneeded for each cell.

3 Experiments

An important performance parameter of our lossydictionaries isthe ability to

store many keys withhigh weight. We tested this ability for lossy dictionaries

usingtwo andthreetables. For comparison,wealsotestedthesimpleone-table

scheme that stores in each cell the key of greatest weight hashing to it. The

testsweredone using trulyrandom hashfunction values, obtainedfrom ahigh

quality collectionof random bits freely available on theInternet [12]. Figure 5

shows experimentally determined values of

p r,αr

, the probability that the key withindex

i = αr

is stored in thedictionary, determined from

10 4

trials. For

the experiments with one and two tables we used table size

r = 2048

while

for the experiment with three tableswe used

r = 1536

. We also tried various

other table sizes, but the graphs were almost indistinguishable from the ones

shown. From Figure 5 we see thesignicant improvement ofmoving from one

tomoretables. Aspredicted,nearlyallofthe

r/ 2

heaviestkeysarestoredwhen

usingtwo tables. Forthreetablesthisnumberincreasestoabout

. 88 r

. Ofthe

r

heaviest keys,about

84%

arestoredwhen usingtwotables, and

95%

arestored

whenusing threetables.

Apartfromasymptoticallyvanishingdierencesaroundthepointwhere the

curves start falling from

1

, the graphs of Figure 5 seem independent of

r

. For

twotablestheobservedvalueof

p r,αr

for

α > 1 / 2

isapproximately

3 . 5 / 9 . 6 α

and

forthree tablesitis approximately

8 / 33 α

for

α > 0 . 95

.

Thegaptothetwo-tablelowerboundofTheorem1canbeexplainedbythe

factthat this lower bound considers only two cells of thehash tables, whereas

opportunitiesfor storingkeys mayappear whenconsidering more cells.

4 Conclusion

Wehaveintroducedtheconceptoflossydictionariesandpresentedasimpleand

ecient data structure implementing a lossy dictionary. Our data structure

combines very ecient lookups and near-optimal space utilization, and thus

seemsapromisingalternativetopreviouslyknowndatastructureswhenasmall

percentage offalse negativesistolerable, suchastheexamplesinSection 1.2.

Though simple and ecient hash functions seem to work well in practice

withourdatastructure,thechallengeofndingsuchfamiliesthatprovablywork

well remains. Furthermore, thelast two graphs in Figure5 are not completely

understood. Thesameistruefortheinsertionheuristicforthreeormoretables.

Acknowledgment We thank Stephen Alstrup and Theis Rauhe for helpful

discussionson the construction of our two table data structure.

(14)

0 0.5 1 1.5 2 2.5 3 0

0.2 0.4 0.6 0.8 1

α One Table

Probability

0 0.5 1 1.5 2 2.5 3

0 0.2 0.4 0.6 0.8 1

Two Tables

Probability

α

0 0.5 1 1.5 2 2.5 3

0 0.2 0.4 0.6 0.8 1

Three Tables

Probability

α

Figure5: Theobservedprobability that the element with

(αr)

th highest weight is stored

whenusingone,twoandthreetables. Fortwotablesourlowerboundisshown.

References

[1] Burton H. Bloom. Space/time trade-os in hash coding with allowable

errors. Communicationsof the ACM,13(7):422426,July 1970.

[2] Andrej Brodnik and J. Ian Munro. Membership in constant time and

almost-minimum space. SIAM J. Comput., 28(5):16271640 (electronic),

1999.

[3] Harry Buhrman, Peter Bro Miltersen, Jaikumar Radhakrishnan, and

S.Venkatesh. Arebitvectors optimal? In Proceedings of the 32nd Annual

ACM Symposium on Theory of Computing (STOC '00), pages 449458.

ACMPress,New York,2000.

[4] LarryCarter,RobertFloyd,JohnGill,GeorgeMarkowsky,andMarkWeg-

man. Exact and approximate membership testers. In Proceedings of the

10thAnnualACMSymposiumon TheoryofComputing (STOC'78),pages

5965. ACM Press,New York,1978.

(15)

AlexanderSchrijver. Combinatorialoptimization. JohnWiley&SonsInc.,

New York, 1998. AWiley-IntersciencePublication.

[6] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti

Penttonen. A reliable randomized algorithm for the closest-pair problem.

Journal of Algorithms,25(1):1951, 1997. doi:10.1006/jagm.1997.0873.

[7] LiFan,PeiCao, JussaraAlmeida,andAndreiZ. Broder. Summarycache:

Ascalablewide-areawebcachesharingprotocol. IEEE/ACMTransactions

on Networking, 8(3):281293, 2000.

[8] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a

sparse table with

O (1)

worst case access time. J. Assoc. Comput. Mach.,

31(3):538544, 1984.

[9] InsungIhmand SanghunPark. Wavelet-based 3Dcompressionscheme for

very large volume data. GraphicsInterface, pages107116, 1998.

[10] StasysJukna.Extremal CombinatoricswithApplications inComputerSci-

ence. Springer-Verlag, 2000.

[11] Tae-YoungKimandYeongGilShin.Anecientwavelet-basedcompression

methodfor volumerendering. InSeventh PacicConference on Computer

Graphicsand Applications, pages147156, 1999.

[12] George Marsaglia. TheMarsaglia random number CDROM including the

diehard batteryoftests ofrandomness. http://stat.fsu.edu/pub/diehard/.

[13] Rasmus Pagh. Low Redundancy inStatic Dictionarieswith

O (1)

Lookup

Time. In Proceedings of the 26th International Colloquium on Automata,

Languages and Programming (ICALP '99), volume 1644 of Lecture Notes

in Computer Science, pages595604. Springer-Verlag,Berlin, 1999.

[14] Rasmus Pagh. OntheCell Probe Complexity ofMembership and Perfect

Hashing. In Proceedings of the 33rd Annual ACM Symposium on Theory

of Computing (STOC'01). ACMPress, NewYork,2001.

[15] RasmusPaghandFlemmingFricheRodler. Cuckoo hashing. Toappearin

Proceedingsof ESA 2001,2001.

[16] Flemming Friche Rodler. Wavelet based 3D compression with fast ran-

dom access for very large volume data. In Seventh Pacic Conference on

Computer Graphics and Applications, pages108117, Seoul, Korea, 1999.

[17] FlemmingFricheRodlerandRasmusPagh. Fastrandomaccessto wavelet

compressed volumetric datausinghashing. Manuscript.

[18] Dennis S. Severance and Guy M. Lohman. Dierential les: Their ap-

plication to the maintenance of large data bases. ACM Transactions on

Database Systems, 1(3):256267, September1976.

(16)

rithm. J. Assoc. Comput.Mach., 22:215225, 1975.

(17)

Recent BRICS Report Series Publications

RS-01-33 Rasmus Pagh and Flemming Friche Rodler. Lossy Dictionar- ies. August 2001. 14 pp. Short version appears in Meyer auf der Heide, editor, 9th Annual European Symposiumon on Al- gorithms, ESA ’01 Proceedings, LNCS 2161, 2001, pages 300–

311.

RS-01-32 Rasmus Pagh and Flemming Friche Rodler. Cuckoo Hash- ing. August 2001. 21 pp. Short version appears in Meyer auf der Heide, editor, 9th Annual European Symposiumon on Al- gorithms, ESA ’01 Proceedings, LNCS 2161, 2001, pages 121–

133.

RS-01-31 Olivier Danvy and Lasse R. Nielsen. Syntactic Theories in Prac- tice. July 2001. 37 pp. Extended version of an article to appear in the informal proceedings of the Second International Work- shop on Rule-Based Programming, RULE 2001 (Firenze, Italy, September 4, 2001).

RS-01-30 Lasse R. Nielsen. A Selective CPS Transformation. July 2001.

24 pp. To appear in Brookes and Mislove, editors, 27th Annual Conference on the Mathematical Foundations of Programming Semantics, MFPS ’01 Proceedings, ENTCS 45, 2000. A prelim- inary version appeared in Brookes and Mislove, editors, Pre- liminary Proceedings of the 17th Annual Conference on Math- ematical Foundations of Programming Semantics, MFPS ’01, (Aarhus, Denmark, May 24–27, 2001), BRICS Notes Series NS-01-2, 2001, pages 201–222.

RS-01-29 Olivier Danvy, Bernd Grobauer, and Morten Rhiger. A Unify- ing Approach to Goal-Directed Evaluation. July 2001. 23 pp.

To appear in New Generation Computing, 20(1), November 2001. A preliminary version appeared in Taha, editor, 2nd International Workshop on Semantics, Applications, and Im- plementation of Program Generation, SAIG ’01 Proceedings, LNCS 2196, 2001, pages 108–125.

RS-01-28 Luca Aceto, Zolt´an ´ Esik, and Anna Ing´olfsd´ottir. A Fully

Equational Proof of Parikh’s Theorem. June 2001.

Referencer

RELATEREDE DOKUMENTER

maripaludis Mic1c10, ToF-SIMS and EDS images indicated that in the column incubated coupon the corrosion layer does not contain carbon (Figs. 6B and 9 B) whereas the corrosion

We found large effects on the mental health of student teachers in terms of stress reduction, reduction of symptoms of anxiety and depression, and improvement in well-being

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

In order to verify the production of viable larvae, small-scale facilities were built to test their viability and also to examine which conditions were optimal for larval

H2: Respondenter, der i høj grad har været udsat for følelsesmæssige krav, vold og trusler, vil i højere grad udvikle kynisme rettet mod borgerne.. De undersøgte sammenhænge

During the 1970s, Danish mass media recurrently portrayed mass housing estates as signifiers of social problems in the otherwise increasingl affluent anish

The Healthy Home project explored how technology may increase collaboration between patients in their homes and the network of healthcare professionals at a hospital, and

Most specific to our sample, in 2006, there were about 40% of long-term individuals who after the termination of the subsidised contract in small firms were employed on