• Ingen resultater fundet

Citations in Wikipedia

N/A
N/A
Info
Hent
Protected

Academic year: 2022

Del "Citations in Wikipedia"

Copied!
51
0
0

Indlæser.... (se fuldtekst nu)

Hele teksten

(1)

Finn ˚Arup Nielsen DTU Compute

Technical University of Denmark May 8, 2017

(2)

What is “the most important ingredient of open knowledge”?

Finn ˚Arup Nielsen 1 May 8, 2017

(3)

Wikicite

Arguably the most important ingredient of open knowledge, sources and references have ironically received little technical attention in the Wikimedia movement up until now. Before Wikidata, at- tempts failed to address the issue of representing citations and source metadata in a well-structured, machine-readable format due to both the lack of mature technology and sufficiently well- organized community efforts.

— (Taraborelli et al., 2016)

(4)

Defining moment for Wikipedia

John Siegenthaler: “This is a highly personal story about Internet char- acter assassination. It could be your story.” (Seigenthaler, 2005)

2005 story about fake news.

Finn ˚Arup Nielsen 3 May 8, 2017

(5)

Defining moment for Wikipedia

John Siegenthaler: “This is a highly personal story about Internet char- acter assassination. It could be your story.” (Seigenthaler, 2005)

2005 story about fake news.

Post-Siegenthaler Wikipedia:

“Wikipedia articles should be based on reliable, published sources . . . ”, see Wikipedia:Identifying reliable sources

Wikipedia:Biographies of living persons: “Be very firm about the use of

(6)

Citations in Wikipedia

In the 1990s it has been speculated that the polymorphism might be related to [[affective disorder]]s,

and an initial study found such a link.<ref>{{cite journal |

vauthors = Collier DA, St¨ober G, Li T, Heils A, Catalano M, Di Bella D, Arranz MJ, Murray RM, Vallada HP, Bengel D, M¨uller CR, Roberts GW, Smeraldi E, Kirov G, Sham P, Lesch KP | title = A novel functional polymorphism within the promoter of the serotonin transporter gene:

possible role in susceptibility to affective disorders | journal = Molecular Psychiatry | volume = 1 | issue = 6 | pages = 453{60 | date

= December 1996 | pmid = 9154246 | author15-link = K. P. Lesch }}

Comment: {{cite journal | vauthors = Craddock N, Owen MJ | title = Candidate gene association studies in psychiatric genetics: a

SERTain future? | journal = Molecular Psychiatry | volume = 1 | issue = 6 | pages = 434{6 | date = December 1996 | pmid = 9154242 }}</ref>

Finn ˚Arup Nielsen 5 May 8, 2017

(7)

Citations in Wikipedia

Citations in Wikipedia can be “structured” with MediaWiki templates, such as cite journal.

(8)

Citations in Wikipedia

Citations in Wikipedia can be “structured” with MediaWiki templates, such as cite journal.

And they can be extracted (Nielsen, 2007).

Finn ˚Arup Nielsen 7 May 8, 2017

(9)

Citations in Wikipedia

Citations in Wikipedia can be “structured” with MediaWiki templates, such as cite journal.

And they can be extracted (Nielsen, 2007).

But with some difficulty:

if ( / \ | \s* j o u r n a l \s*=\s* \ [ \ [ [ ^ \ | \ ] ] + \ | ( [ ^ \ | ] * ? ) \ ] \ ] ( [ ^ \ | ] * ? ) ( ? : \ | | \ } \ } ) / i ) { $ j o u r n a l = $1 . $2 ; }

e l s i f ( / \ | \s* j o u r n a l \s*=\s* \ [ \ [ ( [ ^ \ ] ] * ? ) \ ] \ ] ( [ ^ | ] * ? ) ( ? : \ | | \ } \ } ) / i ) { $ j o u r n a l = $1 . $2 ; }

e l s i f ( / \ | \s* j o u r n a l \s*=\s* \ [ \ S + ( . * ? ) \ ] ( . * ? ) ( ? : \ | | \ } \ } ) / i ) { $ j o u r n a l = $1 . $2 ; }

(10)

Citations in Wikipedia

And we are not finished!

Building a database of journal name variations: http://neuro.compute.

dtu.dk/software/brede/code/brede/xml/wojous.xml

< Jou >

< w o j o u > 10 < / w o j o u >

< n a m e > P r o c e e d i n g s of the N a t i o n a l A c a d e m y of S c i e n c e s of the U n i t e d S t a t e s of A m e r i c a < / n a m e >

< a b b r e v i a t i o n > P N A S < / a b b r e v i a t i o n >

< e i s s n > 1 0 9 1 - 6 4 9 0 < / e i s s n >

< i s s n > 0 0 2 7 - 8 4 2 4 < / i s s n >

< jid > 7 5 0 5 8 7 6 < / jid >

< n a m e P u b m e d > P r o c N a t l A c a d Sci U S A < / n a m e P u b m e d >

< p u b l i s h e r > N a t i o n a l A c a d e m y of S c i e n c e s < / p u b l i s h e r >

< t y p e > jou < / t y p e >

< u r l A r c h i v e > h t t p : // www . p n a s . org / c o n t e n t s - by - d a t e .0. s h t m l < / u r l A r c h i v e >

< u r l H o m e p a g e > h t t p : // www . p n a s . org / < / u r l H o m e p a g e >

Finn ˚Arup Nielsen 9 May 8, 2017

(11)

Citations in Wikipedia

And with all variations:

< v a r i a t i o n > P N A S < / v a r i a t i o n >

< v a r i a t i o n > P N A T L A C A D SCI USA < / v a r i a t i o n >

< v a r i a t i o n > p r o c n a t l a c a d sci us a < / v a r i a t i o n >

< v a r i a t i o n > P r o c N a t l A c a d Sci U S A < / v a r i a t i o n >

< v a r i a t i o n > P r o c . N a t l . A c a d . Sci . USA < / v a r i a t i o n >

< v a r i a t i o n > P r o c . N a t l . A c a d . Sci . U . S . A < / v a r i a t i o n >

< v a r i a t i o n > P r o c e e d i n g s of the N a t i o n a l A c a d e m y of S c i e n c e < / v a r i a t i o n >

< v a r i a t i o n > P r o c e e d i n g s of the N a t i o n a l A c a d e m y of S c i e n c e s < / v a r i a t i o n >

< v a r i a t i o n > P r o c e e d i n g s of the N a t i o n a l A c a d e m y of S c i e n c e s U . S . A < / v a r i a t i o n >

< v a r i a t i o n > P R O C E E D I N G S OF THE N A T I O N A L A C A D E M Y OF S C I E N C E S ( USA ) < / v a r i a t i o n >

< v a r i a t i o n > p r o c e e d i n g s of the u n i t e d s t a t e s n a t i o n a l a c a d e m y of s c i e n c e s < / v a r i a t i o n >

< w i k i p e d i a > P r o c e e d i n g s of the N a t i o n a l A c a d e m y of S c i e n c e s < / w i k i p e d i a >

< / Jou >

(12)

Citations in Wikipedia

In the end we can do:

Finn ˚Arup Nielsen 11 May 8, 2017

(13)

But there is lots of other structured data in Wikipedia.

(14)

Wikidata

Finn ˚Arup Nielsen 13 May 8, 2017

(15)

Wikidata

Wikidata = triples

Triples is a Semantic Web concept (Resource Description Framework), e.g., (Germany, has capital, Berlin)

(16)

Wikidata

Wikidata = triples + qualifiers

Triples is a Semantic Web concept (Ressource Description Framework), e.g., (Germany, has capital, Berlin)

With qualifiers, e.g., (Germany, has capital, Berlin, start time, 1990-10- 03)

Finn ˚Arup Nielsen 15 May 8, 2017

(17)

Wikidata

Wikidata = triples + qualifiers + references

Triples is a Semantic Web concept (Ressource Description Framework), e.g., (Germany, has capital, Berlin)

With qualifiers, e.g., (Germany, has capital, Berlin, start time, 1990-10- 03)

With references, e.g., (Germany, has capital, Berlin, start time, 1990-10- 03, url, http://www.bundestag.de/bundestag/aufgaben/rechtsgrundlagen- /grundgesetz/gg 02.html)

(18)

Wikidata

Note the multilingual nature of Wikidata (here Danish)

Finn ˚Arup Nielsen 17 May 8, 2017

(19)

So we put in bibliographic data

(20)

and citation information

Here Wikidata describes that (Nielsen, 2007) cites (Giles, 2005; Denning et al., 2005; Wilkinson and Huberman, 2007; Kleinberg, 1999).

Finn ˚Arup Nielsen 19 May 8, 2017

(21)

Data entry

Wikicite data relies heavily on individuals and a bioinformatics group:

Magnus Manske: Tools, such as quickstatement and resolver

James Hare: Upload of scientific bibliographic data

Daniel Mietchen: Upload of scientific bibliographic data

San Diego et al. bioinformatics group: Genes, proteins, drugs, diseases, etc. (Mitraka et al., 2015; Burgstaller-Muehlbacher et al., 2016; Putman et al., 2017)

(22)

But so far we got

671’892 scientific articles according to WDQS as of 8 May 2017.

9633 scientific authors as Wikidata items according to WDQS.

1’791’391 unique scientific author strings according to WDQS.

And the number of citations:

“The @Wikidata Citation Graph hit 3 million connections earlier this morning. @Wikicite”

— James Hare announcing on Twitter 30 April 2017

Finn ˚Arup Nielsen 21 May 8, 2017

(23)

Wikidata

Wikidata was first used to capture the language links between Wikipedias.

Now it is being used to fill Wikipedia infoboxes.

Some Wikipedias are using the Wikidata bibliographic items.

But Wikidata has the potential to do more than that.

(24)

Presenting Wikidata: Reasonator

Magnus Manske’s Reasonator, https:

//tools.wmflabs.org/reasonator/

Extracts information from Wiki- data and makes templated (“nat- ural language”) text, maps, time- lines, fetches relevant images, for- mats other information nicely and adds internal and external links.

Runs from Wikimedia Tool Labs

Finn ˚Arup Nielsen 23 May 8, 2017

(25)

Presenting Wikidata: SQID

Markus Kr¨otzsch, Michael G¨unther et al. SQID, https:

//tools.wmflabs.org/sqid/

Wikidata class browser.

Displays typical properties

Runs from Wikimedia Tool Labs

(26)

Scholia

Web site with scholarly information extracted from Wiki- data running from https://tools.wmflabs.org/scholia/

(Nielsen et al., 2017).

Developed from Github under GPL https://github.com/

fnielsen/scholia with work/input from Daniel Mietchen, Egon Willighagen, Jakob Voß, Magnus Manske, Andy Mabbett

Almost entirely built by using Wikidata Query Service,

— an extended SPARQL endpoint available at https://

query.wikidata.org/ maintained by the Wikimedia Foun- dation. Able to not only return tables with SPARQL results but also format the results with charts: maps, bar chart, graphs, etc.

Finn ˚Arup Nielsen 25 May 8, 2017

(27)

Scholia: Author aspect publications per year

Inspired by Shubhanshu Mishra’s and Vetle I. Torvik’s LEGOLAS visualization.

Number of publications per year.

Color-coding based on author- role (first author, last au- thor, middle author, solo author)

(28)

Scholia: Work aspect citation graph

Citation panel on work aspect for partial cita- tion graph.

For A principal com- ponent analysis of 39 scientific impact mea- sures.

Finn ˚Arup Nielsen 27 May 8, 2017

(29)

Scholia: Publisher aspect

Overview of number of papers published and their citations across journals published by the publisher.

Here for BioMedCen- tral (which may be an imprint)

(30)

Scholia: Organization aspect

Incomplete statistics on page production per year for DTU Cognitive Systems.

Finn ˚Arup Nielsen 29 May 8, 2017

(31)

Scholia: Organization aspect

(32)

Scholia: Organization aspect

Co-author graph for DTU Cognitive Systems.

Finn ˚Arup Nielsen 31 May 8, 2017

(33)

“Top 10”

“Top 10: KU-forskere med flest artikler i Nature og Science” https://

uniavisen.dk/top-10-ku-forskere-med-flest-artikler-nature-og-science/

(34)

“Top 10”

“Top 10: KU-forskere med flest artikler i Nature og Science” https://

uniavisen.dk/top-10-ku-forskere-med-flest-artikler-nature-og-science/

Top journals according to Wikidata SPARQL:

V A L U E S ? t o p _ j o u r n a l s { wd : Q 1 9 2 8 6 4 wd : Q 1 8 0 4 4 5 }

University of Copenhagen researchers with Wikidata SPARQL:

S E L E C T D I S T I N C T ? r e s e a r c h e r W H E R E {

{ ? r e s e a r c h e r wdt : P 1 0 8 wd : Q 1 8 6 2 8 5 . }

U N I O N { ? r e s e a r c h e r wdt : P 1 4 1 6 [ wdt : P 3 6 1 * wd : Q 1 8 6 2 8 5 ] . } }

Finn ˚Arup Nielsen 33 May 8, 2017

(35)

“Top 10”: Statistics from WDQS

KU Wikidata Researcher

25 21 Eske Willerslev

83 18 Jun Wang

15 14 Ludovic Orlando

15 7 Søren Brunak

17 2 Niels Grarup

— 2 Eline D. Lorenzen

— 2 Thomas Werge

— 2 Albin Sandelin

— 2 Lars Juhl Jensen

— 2 Anders Krogh

(36)

“Top 10”: Full query

S E L E C T ? n u m b e r _ o f _ p u b l i c a t i o n s ? r e s e a r c h e r L a b e l W I T H {

S E L E C T ? r e s e a r c h e r ( C O U N T (? w o r k ) AS ? n u m b e r _ o f _ p u b l i c a t i o n s ) W H E R E {

V A L U E S ? t o p _ j o u r n a l s { wd : Q 1 9 2 8 6 4 wd : Q 1 8 0 4 4 5 } {

S E L E C T D I S T I N C T ? r e s e a r c h e r W H E R E {

{ ? r e s e a r c h e r wdt : P 1 0 8 wd : Q 1 8 6 2 8 5 . }

U N I O N { ? r e s e a r c h e r wdt : P 1 4 1 6 [ wdt : P 3 6 1 * wd : Q 1 8 6 2 8 5 ] . } }

}

? w o r k wdt : P50 ? r e s e a r c h e r .

? w o r k wdt : P 1 4 3 3 ? t o p _ j o u r n a l s . }

G R O U P BY ? r e s e a r c h e r } AS % r e s u l t

Finn ˚Arup Nielsen 35 May 8, 2017

(37)

“Top 10”: Full query

W H E R E {

I N C L U D E % r e s u l t

S E R V I C E w i k i b a s e : l a b e l {

bd : s e r v i c e P a r a m w i k i b a s e : l a n g u a g e " en " . }

}

O R D E R BY D E S C (? n u m b e r _ o f _ p u b l i c a t i o n s ) L I M I T 10

(38)

“Top 10”

“Top 10: KU-forskere med flest artikler i Nature og Science”

Wikidata SPARQL can make make an on-the-fly answer to this, — but yet incomplete.

Data is lacking due to the problem of resolving names like Wang, Zhang, Hansen, Pedersen, etc.

Finn ˚Arup Nielsen 37 May 8, 2017

(39)

Give me an introductory paper

What is the best introductory/overview paper on word embeddings?

(40)

Give me an introductory paper

What is the best introductory/overview paper on word embeddings?

We are not there yet.

Finn ˚Arup Nielsen 39 May 8, 2017

(41)

Give me an introductory paper

What is the best introductory/overview paper on word embeddings?

We are not there yet.

But we can get “Most cited works from works on the topic” from the topic aspect of word embedding pages.

(42)

Give me an introductory paper

What is the best introductory/overview paper on word embeddings?

We are not there yet.

But we can get “Most cited works from works on the topic” from the topic aspect of word embedding pages.

This gives: (Mikolov et al., 2013b; Mikolov et al., 2013a) in a table.

Finn ˚Arup Nielsen 41 May 8, 2017

(43)

In complete data

Finn ˚Arup Nielsen h-index (8 May 2017):

h Service

28 Google Scholar 20 Scopus

18 Web of Science

8 Wikidata with Scholia query

(44)

Wikidata-based BIBTeX generation

A rough-in-the-edges implementation in Scholia can generate BIBTeX .bib files from .aux files

My .tex file:

\bibliographystyle{Nielsen2012Slides}

\bibliography{Nielsen2017Wikicite_slides}

Commands:

latex Nielsen2017Wikicite_slides.tex

python -m scholia.tex write-bib-from-aux Nielsen2017Wikicite_slides.aux bibtex Nielsen2017Wikicite_slides

latex Nielsen2017Wikicite_slides.tex latex Nielsen2017Wikicite_slides.tex

Finn ˚Arup Nielsen 43 May 8, 2017

(45)

Wikicite issues :(

Wikidata far from complete

Citation data lacking, but some released with I4OC.

Paper affiliations are not made, thus scientometrics with precise affiliation resolving is not possible at the moment.

Large-scale analysis is difficult with WDQS because of time-out.

(46)

Wikicite issues :)

Wikidata act as a hub for different resources linking Google Scholar, Twitter, Scopus, VIAF, ResearchGate, ...

Good author disambiguation possible (I have the suspicion that VIAF uses Wikidata for correcting their database), — even for authors that do not have an account on the site.

Data description more detailed with many different properties: main theme, genre, multiple affiliation with time points, sex of author, license, sponsor, etc.

Linking to much more than science: Wikidata is becoming the “Internet duct tape that can solve anything” (light-hearted comment by Andrew Lih, somewhere on Facebook)

Finn ˚Arup Nielsen 45 May 8, 2017

(47)

What’s next for Scholia and Wikicite?

Continued upload of data available from API to Wikidata.

Building scrapers, e.g., in Scholia.

Better integration between panels and aspects in Scholia (Javascript and D3 work)

“Editable Scholia”: Edit Wikidata items from Scholia. (Magnus Manske implements editing with his Listeria tool).

“Social Scholia”: User login, followers, followees, messages between

(48)

Wikicite and libraries

Ask not what Wikidata can do to you or what you can do to Wikidata, but what Wikidata can do to you and you can do to Wikidata.

Finn ˚Arup Nielsen 47 May 8, 2017

(49)

Wikicite and libraries

Ask not what Wikidata can do to you or what you can do to Wikidata, but what Wikidata can do to you and you can do to Wikidata.

What about upload all of Danish research available at the Danish National Research Database?

Author disambiguation: Three entries in the Dan- ish National Research Database for “Finn ˚Arup Nielsen”. There is only one.

(50)

Thanks

Finn ˚Arup Nielsen 49 May 8, 2017

(51)

References

Burgstaller-Muehlbacher, S., Waagmeester, A., Mitraka, E., Turner, J., Putman, T. E., Leong, J., Naik, C., Pavlidis, P., Schriml, L., Good, B. M., and Su, A. I. (2016). Wikidata as a semantic framework for the Gene Wiki initiative. Database, 2016:baw015. DOI: 10.1093/DATABASE/BAW015.

Denning, P., Horning, J., Parnas, D., and Weinstein, L. (2005). Wikipedia risks. 48:152.

DOI: 10.1145/1101779.1101804.

Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438:900–901. DOI: 10.1038/438900A.

Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:604–632. DOI: 10.1145/324133.324140.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space.

Mikolov, T., Dean, J., and Corrado, G. (2013b). Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems, pages 3111–3119.

Mitraka, E., Waagmeester, A., Burgstaller-Muehlbacher, S., Schriml, L. M., Su, A. I., and Good, B. M.

(2015). Wikidata: A platform for data integration and dissemination for the life sciences and beyond.

DOI: 10.1101/031971.

Nielsen, F. ˚A. (2007). Scientific citations in Wikipedia. First Monday, 12. DOI: 10.5210/FM.V12I8.1997.

Nielsen, F. ˚A., Mietchen, D., and Willighagen, E. (2017). Scholia and scientometrics with Wikidata.

Putman, T. E., Lelong, S., Burgstaller-Muelhbacher, S., Waagmeester, A., Diesh, C., Dunn, N., Munoz- Torres, M., Stupp, G., Su, A. I., and Good, B. M. (2017). WikiGenomes: an open Web application for

Referencer

RELATEREDE DOKUMENTER

At Hultagatan the average speed of the vehicles before the intersection were reconstructed was 52 km/h ± 8 km/h (with standard deviation 8 km/h) for the whole sample and the

2669.. Savry in Amstelred. Kittendorft' del.). Sam tidigt T ræ snit. „F indes tilkiøbs hos J. Tribler, H olm ensgade No.. Sam tidigt kolor.. Sam tidigt kolor. Sam tidigt

Annotation: A study on a range of quality of scientific articles on the En- glish Wikipedia along a number of di- mensions, e.g., coverage, referencing, length, user perception..

Skanderup S, H jelm slev H og Skanderborg Kbst.. Dy

dom. Først udkom SFI’s undersøgelse om fattigdom og afsavn, og senest har Rock- woolfondens Forskningsenhed udgivet deres længe ventede minimumsbudgetter.. fattigdom og

2669. Danske Konger før den oldenborgske Stamme. Bærentzen &amp; Co. Foranst.’s første Gemalinde. Frederik I ’s anden Gemalinde. Bærentzen &amp; Co. Bærentzen &amp;

nedsat 2 undergrupper: Prognosegruppen (fornand: H.P. Myrup) og Planlægningsgruppen (formand: H.. Daniel Nielsen som

All companies in Denmark providing electricity production, transmission, and distribution pursuant to the Danish Electricity Supply Act, together with Energinet.dk, have