Metadata

Anne-Marie Mineur
UiL-OTS / LOT
Utrecht University

Introduction

Summary of the previous

  • Monday: XML ins and outs
  • Tuesday: XML-related proposals
  • Wednesday: XML software
  • Thursday: Language encoding recommendations
  • Today: Metadata

In the beginning ...

Original from http://www.wildireland.ie/screensaver/desktops/Images/cobweb.jpg

When the web was new and shiny, everything seemed to be brilliant, and it was just a plain gold mine.

But then ...

Original from http://www.biotech.ucdavis.edu/images/paperwork.jpg

But then so many people started piling their stuff on the web, that it became very hard to find what you were looking for.

The result ...

Original from http://hawaii.psychology.msstate.edu/photos/funphotos/Haystack.jpeg

The result was a haystack in which every needle gets lost.

Metadata organisations

  • Primary aim: increase accessibility of existing archives
  • Secondary aim: promote the use of open standards
  • Original beneficiaries: E-print archives

Metadata proposals & projects

  • DCMI: Dublin Core Metadata Initiative (1995)
  • OAi: Open Archives Initiative (1999)
  • OLAC: Open Linguistic Archives Community (2001)
  • LTRC: Language Typology Resource Center (2002)
  • TDP: Typological Database System (2003?)

Dublin Core Metadata Initiative (DCMI)

Dublin Core Metadata Initiative (1)

The Dublin Core Metadata Initiative (dublincore.org) was founded during a joint workshop of the National Center for Supercomputing Applications (NCSA) and the Online Computer Library Center (OCLC) that was held in Dublin, Ohio, March 1995.

The aim was a core set of semantics for Web-based resources would be extremely useful for categorizing the Web for easier search and retrieval.

Dublin Core Metadata Initiative (2)

Quote from http://dublincore.org/about/:

"The Dublin Core Metadata Initiative (DCMI) is an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems."

Dublin Core Specifications

There are four kinds of documents:
DCMI Recommendations
Specifications are stable and are supported for adoption.
DCMI Proposed Recommendations
Growing support for adoption; specifications are close to stable.
DCMI Working Drafts
For review.
Notes
For discussion only.

Dublin Core Element Set (DCES)

From http://dublincore.org/documents/dces/:

Title
A name given to the resource.
Creator
An entity primarily responsible for making the content of the resource.
Subject
The topic of the content of the resource.
Description
An account of the content of the resource.
Publisher
An entity responsible for making the resource available.
Contributor
An entity responsible for making contributions to the content of the resource.
Date
A date associated with an event in the life cycle of the resource.
Type
The nature or genre of the content of the resource.
Format
The physical or digital manifestation of the resource.
Identifier
An unambiguous reference to the resource within a given context.
Source
A Reference to a resource from which the present resource is derived.
Language
A language of the intellectual content of the resource.
Relation
A reference to a related resource.
Coverage
The extent or scope of the content of the resource.
Rights
Information about rights held in and over the resource.

Element Attributes

Name
Subject and Keywords
Identifier
Subject
Version
1.1
Registration Authority
Dublin Core Metadata Initiative
Language
en
Definition
The topic of the content of the resource.
Obligation
Optional
Datatype
Character String
Maximum Occurrence
Unlimited
Comment
Typically, a Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.

Open Archives Initiative (OAi)

Open Archives Initiative (1)

The Open Archives Initiative (http://www.openarchives.org/) was established in 1999 under the joint sponsorship of the:
  • CLIR: Council on Library and Information Resources
  • DLF: Digital Library Federation
  • SPARC: Scholarly Publishing & Academic Resources Coalition
  • ARL: Association of Research Libraries
  • LANL: Los Alamos National Library

Open Archives Initiative (2)

Quote from http://www.openarchives.org/organization/:

Mission Statement

"The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication."

OAi Members

Cyclades Project
An Open Collaborative Virtual Archive Environment funded by the European Union.
Kepler's Home Page
self-contained, self-installing software that allows the user to create and maintain a small, OAi-compliant archive - archivelet.
Open Archive Forum (OAF)
An EC-funded accompanying measures project to support projects and national initiatives, which are interested in using an open archive approach to interoperability.
Virginia Tech Digital Library Research Lab Projects
Open Language Archives Community (OLAC)
a cross-archive searching service for the language community.

Data providers

  • A Celebration of Women Writers
  • Academia Sinica Balanced Corpus of Modern Chinese
  • Academia Sinica Formosan Language Archive
  • Academia Sinica Tagged Corpus of Early Mandarin Chinese
  • Ackerman Archives: Experimental File-based OAI Archive
  • Alaska Native Language Center
  • Alex Catalogue of Electronic Texts
  • Archive Lyon 2
  • Articles en ligne Jean Nicod
  • arXiv
  • ATILF Resources
  • BioMed Central
  • California Digital Library Repository 1
  • California International and Area Studies Digital Repository
  • Caltech Computer Science Technical Reports
  • Caltech Earthquake Engineering Research Laboratory Technical Reports
  • Caltech Electronic Theses and Dissertations
  • Chemistry Preprint Server
  • CIMI Metadata Harvesting Working Group Demonstration Repository
  • Cognitive Science Data Archive
  • CogPrints
  • Comparative Bantu Online Dictionary (CBOLD)
  • Computer Science Teaching Center
  • conoZe: intelligere ut credas, credere ut intelligas
  • CyberTheses
  • Dermatology Digital Repository
  • Digital Library of the Commons
  • DUETT - Dissertations and other Documents of the Gerhard-Mercator-University Duisburg
  • E-Numerate RDL Header Collection Prototype
  • Elektronisches Dokumenten-, Archivierungs- und Retrievalsystem der Universität Dortmund
  • Eprint Archive
  • EPSILON EPrints2 Dissertation Test Archive
  • ePub-WU OAI Archive (Vienna Univ. of Econ. and B.A.)
  • ETD Individuals
  • Ethnologue: Languages of the World
  • Formations
  • Fourth International Symposium on Cavitation
  • Groningen University Library
  • Hochschulschriftenserver (HSSS) der SLUB Dresden
  • HofPrints
  • Hong Kong University Theses Online
  • Humboldt University of Berlin, GERMANY, Document Server
  • Ibiblio Collection Index

Service Providers

  • Arc (Old Dominion University)
  • citebaseSearch (Southampton University
  • DP9 (Old Dominion University)
  • NCSTRL (Old Dominion University, University of Virginia)
  • OAIster (University of Michigan Libraries)
  • Public Knowledge Harvester (U. of British Columbia
  • TORII (Trieste, Italy)

Open Linguistic Archives Consortium (OLAC)

Open Linguistic Archives Consortium

OLAC (http://www.language-archives.org/) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:

  • developing consensus on best current practice for the digital archiving of language resources, and
  • developing a network of interoperating repositories and services for housing and accessing such resources.
"Any user on the Internet should be able go to a single gateway to find all the language resources available at all participating institutions, whether the resources be data, tools, or advice. The community will ensure on-going interoperation and quality by following standards for the metadata that describe resources and services and for processes that review them."

(From http://www.language-archives.org/OLAC/process.html)

Migrating from OAi to OLAC

The implementer of an OLAC data provider must implement the OAi protocol, plus implement three additional features. The additions are:

  • support the OAi format for unique identifiers of records
  • supply an OLAC-specific archive description
  • support the OLAC-specific metadata standard

OLAC Metadata Set

(From http://www.language-archives.org/OLAC/olacms.html)

Title
Creator
Subject
Subject.language
A language which the content of the resource describes or discusses.
Description
Publisher
Contributor
Date
Type
Type.functionality
Software Functionality
Type.linguistic
The nature or genre of the content of the resource from a linguistic standpoint.
Format
Format.cpu
The CPU required to use a software resource.
Format.encoding
An encoded character set used by a digital resource.
Format.markup
A markup scheme used by a digital resource.
Format.os
An operating system required to use a software resource.
Format.sourcecode
A programming language of software distributed in source form.
Identifier
Source
Language
Relation
Coverage
Rights

Language Typology Resource Center (LTRC)

Language Typology Resource Center

http://www-uilots.let.uu.nl/td/LTRC/

The Language Typology Research Center is a project that is funded by the European Community, consisting of a "thematic network" to be carried out in the framework of the specific research and technological development programme "Improving the Human Research Potential and the Socio-Economic Knowledge Base".

Aim of the LTRC Project

Aim of the initiative is to create a web-accessible electronic archive for typological description, including powerful research tools such as typological databases, language-typological expert systems, extensive scientific grammars, and corpora.

Objectives of the LTRC project

  1. to combine expertise available with the different partners in the research network in developing typological databases;
  2. to collect relevant information on available databases and datasets in Europe;
  3. to stimulate the conversion of a number of existing databases (digitised or otherwise) into standardised formats
  4. to disseminate expertise from the participants in setting up databases and solving technical and fundamental issues
  5. to encourage establishing standards that the linguistic typological community will adhere to in creating databases.
  6. to acquire and develop the appropriate metadata set information, and appropriate software for accessing databases, either directly or through a meta database system as part of a Language Typology Resource Centre;
  7. to make the results available to the linguistic community via a Language Typology Resource Centre Website.

Network Members

  1. Netherlands Graduate School in Linguistics (LOT) (wwwlot.let.uu.nl)
  2. Linguistics department, Max Planck Institute for Evolutionary Anthropology (www.eva.mpg.de/lingua/)
  3. Institut für Englische Philologie, Freie Universität Berlin (http://www.fu-berlin.de/)
  4. Department of Linguistics, Stockholm University (www.ling.su.se/)
  5. Department of Linguistics and International studies, University of Surrey (www.surrey.ac.uk/LIS/)
  6. Department of Computer Science, University of Lisbon (www.di.fc.ul.pt/)
  7. Sprachbaupläne, University of Konstanz (ling.uni-konstanz.de/)
  8. Department of Linguistics and Modern English Language, University of Lancaster (http://www.ling.lancs.ac.uk/)
  9. Zentrum für Allgemeine Sprachwissenschaft Typologie und Universalienforschung (Berlin) (http://www.zas.gwz-berlin.de/)

Databases involved in the LTRC

  1. Intensifiers/Emphatic reflexives (and reflexives) (Freie Universität Berlin) [XML metadata]
  2. Surrey Syncretism Database (University of Surrey) [XML metadata]
  3. Utrecht Linguistic Database, Aspect (University of Utrecht) [XML metadata]
  4. Nexing Corpus (University of Lisbon/Coimbra)[XML metadata]
  5. StressTyp (University of Utrecht) [XML metadata]
  6. Person Agreement Database (University of Amsterdam / Lancaster)[XML metadata]
  7. Spinoza Areal Database (Leiden University)[XML metadata]
  8. The Intonational Lexicon: Contours and Alignment in Read Texts (Konstanz)
  9. Database Sprachbaupläne/Universals Archive (Konstanz)
  10. Dualis constructions database (Konstanz)
  11. Number use in language: a quantitative and typological investigation (Surrey).
  12. Anaphora database (Language in Use project) (Utrecht)
  13. World Atlas of Language Structures (Leipzig)
  14. Word order databse (Amsterdam)
  15. Evidentiality database (Amsterdam)
  16. Comparison database (Nijmegen)
  17. Intransitive predication database (Nijmegen)

Typological Database Project (TDP)

Typological Database Project

http://www-uilots.let.uu.nl/td/

The Typological Database Project (TDP) is funded by the Netherlands Organisation for Scientific Research (NWO) and like LTRC is also chaired by the Netherlands Graduate School of Linguistics (LOT) at Utrecht University.

The project is currently 'in between funding'. The pilot project ended last year, and although the renewal application got A-status, there was no money for new funding this year. Another reapplication is underway.

Partners

  1. Utrecht institute of Linguistics OTS, Utrecht University (www-uilots.let.uu.nl)
  2. Language and Cognition group, Max Planck Institute, Nijmegen (http://www.mpi.nl/)
  3. Department of Linguistics, University of Amsterdam (http://www.hum.uva.nl/)
  4. Department of Linguistics Nijmegen University (atd.let.kun.nl/)
  5. Department of Linguistics, Leiden University (www.leidenuniv.nl/let/)
  6. Grammatical Models group, Tilburg University (http://www.kub.nl/)

Aim of the TDS project

Create a Typological Database System that allows the user

  • to access a large number of databases through one single interface
  • to combine the results in an intelligent way.

Example

StressTyp

  • Contents: Stress systems
  • Coverage: about 500 languages from all over the world
  • Software: 4th Dimension software, on a server at Leyden University
  • Chomskyan tradition

Eurotype project: word order

  • Contents: Word order
  • Coverage: 150 European languages
  • Software: own database application in Pascal, working on a conversion to Microsoft Access. Not on-line.
  • Functional Grammar (Simon Dik)

TDS Architecture

Copyright: Typological Database Project / Utrecht University

Goals

  • Support and promote the development of typological databases, both phenomenon-specific ones and general ones.
  • Develop a linguistic metalanguage, that is, a list of relevant terminology, as well as a list of relevant phenomena.
  • Develop content metadata to describe the content and the format of the various databases participating in the project, that is, the relevant metadata elements and the appropriate vocabulary, to aim at a common standard.
  • Development of an appropriate user interface.

Summary

Summary

  • Metadata initiatives benefit from standards
    • Character encoding
    • Software encoding
    • Linguistic terminology
  • XML seems suited for at least the metadata level

Needle, haystack ...?

Perhaps in the not to far future, we will find that needle in the haystack?

Original from http://www.apax.com/images/v1/Haystack-FINALb.jpg