• Ingen resultater fundet

Looking Back, Looking Forward: New Strategies for Coverage of a National Web Sphere

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Looking Back, Looking Forward: New Strategies for Coverage of a National Web Sphere"

Copied!
21
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Looking Back, Looking Forward: New Strategies for Coverage of

a National Web Sphere

By Eld Zierau

The Royal Library of Denmark Ditte Laursen

The State and University Library of Denmark

(2)

011001

010010101110110

101100110011100 0110101010100010 1011110110110 11010010100110 100100010010 110100 010

0010 00

010 101

0110101 001001 01000010

1011011 0100101000101110

100001011111111001111000111

100001011111111001111000111

1001010101100110010010011100111

1010100010101001100100100 1001

0100010101001100100 100 0010

101101101101001010 001

0110100001001010000 10111

100010110010101011001 10010010

1010100010101001100100100110011110

11000101010011001001001100100111100

10101000101010011001001001100101010001

10101000101010011001001001100101010001

101010011001001001101010001010100110

01001101010001010100110010101 011 0010101001100101010001010100

001010100010101001101100101

00101010001010101000101010

001010100010101101010101

001010100010101 0101

0010101000 00

00 10101000101 1

001010

001010 1011 11001010

10100010

110

1100111010

0 0010011101101 10011001000110101

01010001010111001011

1011011011010010100010

01101000010010100001011111

100010110010101011001100100100

1010100010101001100100100110011

101010001010100110010010011001

10101000101010011001001001100

10101000101010011001001001

10101000101010011001001

101010001010100110010

1010100010101001100

1010100010101001100

101010001010100

0101000101

0101

01

11001

010010

10111011

1011001100

111000110101

01010001010111

101101101101001010

011010 010001001011 0 01 00 001001011111

01 01 1 0100010110010

11 101 101011001100100100101

1100 01000101010011001001001 1100100010101001100100100111

110010001010100110 010 00111 1100100 010 10100110 01001

11001 0 0010101 001100100100 11 001 110 1000100100001 0010011101 10100101

100110010001101 101001010

0101000101011100101111100111100111

101101101101001010001011100111001111

0110100001001010000101111111100111100

10001011001010101100110010010011100111

1010100010101001100100100110011111001

1010100010101001100100100110011111001

1010001010100110010010011001001111001

1010100010101001100100100110010101000

101010011001001001101010001010100110

0100110101000101010011001010101110

0010101001100101010001010100101

0010101000101010011

0010101000101010

001010100010101101

00101010001010101

0010101000101001

001010100010101

00101010001010

00101010001

00101010

001010

0010

0

The Web Sphere

Danish sphere web

Rest of WWW

.dk-domænet Outside

.dk-domænet

increasing amount of national webpages moves to generic Top Level Domains like

(3)

Contents

 A bit of history

 The implementation

(4)

.dk and non .dk

38.000

(5)

.dk and non .dk

(6)

Top 10 biggest domains

2006 tv2.dk dbu.dk

miniclip.com microsoft.com inforce.dk

radioorbit.dk omegn.dk karstenskj.dk um.dk

pornaccess.com

2015 snatch.dk

cloudfront.net pinterest.com twitter.com google.com vimeocdn.com

googleusercontent.com blogspot.com

amazonaws.com

dmi.dk

(7)

.dk and non .dk

38.000 44.000

91.751

(Zierau 2015)

(8)

WebDanica project Tested Different Methods

Internet Archive method NetArchive Link method

NL-data

Outlinks from Danish broad crawl 2012

Find Danish 0

webpages

IA-data

World wide collection 2012 Wide0005

Very few common

results NL

results IA

results

General implementation covering more methods

Host: 1. part of URL http://abc.xx/def/ghi/...

Only in NL 46.552 Only in IA

43.185

Both in IA and NAL

2.014

(9)

Implemention – what to do

How to implement?

 Automate whenever possible

 Support web curators work

 Support several methods

◦ Similar to NL (based on seed)

◦ Similar to IA (based on existing extracts)

◦ Known Danish outside .dk seeds

(10)

Present setup for harvest and preservation

Implementation – the seed washer

Netarkivet

Seed list for special harvest (Sub)domains for

bulk harvests

Find outside .dk Seeds to be

examined Extracts to be

examined Known DK seeds

to be included Runs

independently of

present operation

(11)

Seed washer – Seeds method

”Clean” seed list for known

or banned seeds List of seeds to be procecessed

Known or banned seeds /domæne e.g. from

• NA outlinks

• researchers

• .nu .tv …

• …

Seeds to be examined

Find outside .dk

Seeds list

(12)

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Harvest and generate harvest extracts

List of seeds to be procecessed

Known or banned seeds /domæne

Harvest extracts Seeds

to be examined

Find outside .dk Seeds list

Monitoring:

Special harvests - not under legal

legislation - Not bit

preserved

(13)

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Harvest and generate harvest extracts

List of seeds to be procecessed

Known or banned seeds /domæne

Harvest extracts Seeds

to be examined

Find outside .dk Seeds list

e.g. Parsed text from IA

Extracts to be examined

Generated in form

of inputting

(14)

Seed washer – Seeds method

”Clean” seed list for known

or banned seeds List of seeds to be procecessed

Calculate basis for nationa-

lity determination of seeds Seeds with nationality related information

Harvest extracts Seeds

to be examined

Find outside .dk

Seeds list Known or banned

seeds /domæne

Extracts to be examined

Harvest and generate harvest extracts

Some are only

likely Danish

Some are

NOT Danish

(15)

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Find Danish seeds

List of seeds to be procecessed

Found Danish seeds Calculate basis for nationa-

lity determination of seeds Seeds with nationality related information

Harvest extracts Seeds

to be examined

Find outside .dk

Seeds list Known or banned

seeds /domæne

Extracts to be examined

Harvest and generate harvest

extracts

(16)

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Find Danish seeds

List of seeds to be procecessed

Found Danish seeds Calculate basis for nationa-

lity determination of seeds Seeds with nationality related information

Harvest extracts Seeds

to be examined

Find outside .dk

Seeds list Known or banned

seeds /domæne

Extracts to be examined

Harvest and generate harvest extracts

Known DK seeds to be

included e.g. from researchers, manual search, Facebook API,

Generated in form

of inputting

(17)

Find (sub)domain candidates for bulk harvest

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Find Danish (sub)domains

List of seeds to be procecessed

Create seed list for special harvest

Found Danish seeds Calculate basis for nationa-

lity determination of seeds Seeds with nationality related information

Harvest extracts Seeds

to be examined

Find outside .dk Seeds list

List of bulk harvest (sub)domains candidates

Known or banned seeds /domæne

Extracts to be examined

Known DK seeds to be

included

Harvest and generate harvest extracts

Find Danish seeds

Manual task:

(18)

Find (sub)domain candidates for bulk harvest

Seed washer – Seeds method

”Clean” seed list for known or banned seeds

Find Danish (sub)domains

Harvest and generate harvest extracts

Find Danish seeds

List of seeds to be procecessed

Create seed list for special harvest

Found Danish seeds Calculate basis for nationa-

lity determination of seeds Seeds with nationality related information

Harvest extracts Seeds

to be examined

Find outside .dk Seeds list

List of bulk harvest (sub)domains candidates

Known or banned seeds /domæne Extracts

to be examined Known DK

seeds to be included

Minimize time gap between

harvests

(19)

Assumptions

 It is possible to minimize the harvest gap sufficiently

 There are no legal issues in harvesting outside the .dk top level domain

 It is acceptable

◦ That we lose material

The only included seeds, are the ones where there are about 90%

probability of being relevant for Danish heritage

◦ That we include some noise:

About 10% of the included seeds are not relevant for Danish heritage

(20)

The way forward …

 Implementations in 2016 – just started

 Expects to evaluate and adjust after 2-3 years

Netarkivet

(21)

Questions

Images of this style from digitalbevaring.dk

Referencer

RELATEREDE DOKUMENTER

The synchronic analysis of the discourse strand of prominent public arguments against racial quotas in Brazilian higher education has shown that their persuasive power

And when, When there’s a lot of stuff in the market, and people just are looking at improving returns you know if you have a lot of funds, competing to be the ESG fund or the you

By looking at the position of corporations in their respective fields, as either incumbent or challengers, we explore how central characteristics – the symbolic and economic

I regnskapsanalysen kunne vi se at det var en liten spredning mellom selskapene i flere av estimatene, men på tross av dette vurderer jeg peers som de mest sammenlignbare

Der er ingen sikre ændringer i havrekernens procentiske indhold af kvælstof, uanset om hø- sten foretages tidligt eller sent (tabel 2), men der er store forskelle mellem

Looking at these and related questions from a urban and rural, western and non-western, national, global and geo- political perspective will help us comprehend the impact of cultural

For historians or art historians, these are points of departure for different directions the research can take: the villa as an example of central

The data context component is responsible for providing the set of necessary data access methods for the documents page, and the general structure of