Metadata

Anne-Marie Mineur
UiL-OTS / LOT
Utrecht University

Introduction

Summary of the previous

Monday: XML ins and outs
Tuesday: XML-related proposals
Wednesday: XML software
Thursday: Language encoding recommendations
Today: Metadata

In the beginning ...

Original from http://www.wildireland.ie/screensaver/desktops/Images/cobweb.jpg

When the web was new and shiny, everything seemed to be brilliant, and it was just a plain gold mine.

But then ...

Original from http://www.biotech.ucdavis.edu/images/paperwork.jpg

But then so many people started piling their stuff on the web, that it became very hard to find what you were looking for.

The result ...

Original from http://hawaii.psychology.msstate.edu/photos/funphotos/Haystack.jpeg

The result was a haystack in which every needle gets lost.

Metadata organisations

Primary aim: increase accessibility of existing archives
Secondary aim: promote the use of open standards
Original beneficiaries: E-print archives

Metadata proposals & projects

DCMI: Dublin Core Metadata Initiative (1995)
OAi: Open Archives Initiative (1999)
OLAC: Open Linguistic Archives Community (2001)
LTRC: Language Typology Resource Center (2002)
TDP: Typological Database System (2003?)

Dublin Core Metadata Initiative (DCMI)

Dublin Core Metadata Initiative (1)

The Dublin Core Metadata Initiative (dublincore.org) was founded during a joint workshop of the National Center for Supercomputing Applications (NCSA) and the Online Computer Library Center (OCLC) that was held in Dublin, Ohio, March 1995.

The aim was a core set of semantics for Web-based resources would be extremely useful for categorizing the Web for easier search and retrieval.

Dublin Core Metadata Initiative (2)

Quote from http://dublincore.org/about/:

"The Dublin Core Metadata Initiative (DCMI) is an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems."

Dublin Core Specifications

There are four kinds of documents:

DCMI Recommendations: Specifications are stable and are supported for adoption.
DCMI Proposed Recommendations: Growing support for adoption; specifications are close to stable.
DCMI Working Drafts: For review.
Notes: For discussion only.

Dublin Core Element Set (DCES)

From http://dublincore.org/documents/dces/:

Title: A name given to the resource.
Creator: An entity primarily responsible for making the content of the resource.
Subject: The topic of the content of the resource.
Description: An account of the content of the resource.
Publisher: An entity responsible for making the resource available.
Contributor: An entity responsible for making contributions to the content of the resource.
Date: A date associated with an event in the life cycle of the resource.
Type: The nature or genre of the content of the resource.
Format: The physical or digital manifestation of the resource.
Identifier: An unambiguous reference to the resource within a given context.
Source: A Reference to a resource from which the present resource is derived.
Language: A language of the intellectual content of the resource.
Relation: A reference to a related resource.
Coverage: The extent or scope of the content of the resource.
Rights: Information about rights held in and over the resource.

Element Attributes

Name: Subject and Keywords
Identifier: Subject
Version: 1.1
Registration Authority: Dublin Core Metadata Initiative
Language: en
Definition: The topic of the content of the resource.
Obligation: Optional
Datatype: Character String
Maximum Occurrence: Unlimited
Comment: Typically, a Subject will be expressed as keywords, key phrases or classification codes that describe a topic of the resource. Recommended best practice is to select a value from a controlled vocabulary or formal classification scheme.

Open Archives Initiative (OAi)

Open Archives Initiative (1)

The Open Archives Initiative (http://www.openarchives.org/) was established in 1999 under the joint sponsorship of the:

CLIR: Council on Library and Information Resources
DLF: Digital Library Federation
SPARC: Scholarly Publishing & Academic Resources Coalition
ARL: Association of Research Libraries
LANL: Los Alamos National Library

Open Archives Initiative (2)

Quote from http://www.openarchives.org/organization/:

Mission Statement

"The Open Archives Initiative develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content. The Open Archives Initiative has its roots in an effort to enhance access to e-print archives as a means of increasing the availability of scholarly communication."

OAi Members

Cyclades Project: An Open Collaborative Virtual Archive Environment funded by the European Union.
Kepler's Home Page: self-contained, self-installing software that allows the user to create and maintain a small, OAi-compliant archive - archivelet.
Open Archive Forum (OAF): An EC-funded accompanying measures project to support projects and national initiatives, which are interested in using an open archive approach to interoperability.
Virginia Tech Digital Library Research Lab Projects
Open Language Archives Community (OLAC): a cross-archive searching service for the language community.

Data providers

A Celebration of Women Writers
Academia Sinica Balanced Corpus of Modern Chinese
Academia Sinica Formosan Language Archive
Academia Sinica Tagged Corpus of Early Mandarin Chinese
Ackerman Archives: Experimental File-based OAI Archive
Alaska Native Language Center
Alex Catalogue of Electronic Texts
Archive Lyon 2
Articles en ligne Jean Nicod
arXiv
ATILF Resources
BioMed Central
California Digital Library Repository 1
California International and Area Studies Digital Repository
Caltech Computer Science Technical Reports
Caltech Earthquake Engineering Research Laboratory Technical Reports
Caltech Electronic Theses and Dissertations
Chemistry Preprint Server
CIMI Metadata Harvesting Working Group Demonstration Repository
Cognitive Science Data Archive
CogPrints
Comparative Bantu Online Dictionary (CBOLD)
Computer Science Teaching Center
conoZe: intelligere ut credas, credere ut intelligas
CyberTheses
Dermatology Digital Repository
Digital Library of the Commons
DUETT - Dissertations and other Documents of the Gerhard-Mercator-University Duisburg
E-Numerate RDL Header Collection Prototype
Elektronisches Dokumenten-, Archivierungs- und Retrievalsystem der Universität Dortmund
Eprint Archive
EPSILON EPrints2 Dissertation Test Archive
ePub-WU OAI Archive (Vienna Univ. of Econ. and B.A.)
ETD Individuals
Ethnologue: Languages of the World
Formations
Fourth International Symposium on Cavitation
Groningen University Library
Hochschulschriftenserver (HSSS) der SLUB Dresden
HofPrints
Hong Kong University Theses Online
Humboldt University of Berlin, GERMANY, Document Server
Ibiblio Collection Index

Service Providers

Arc (Old Dominion University)
citebaseSearch (Southampton University
DP9 (Old Dominion University)
NCSTRL (Old Dominion University, University of Virginia)
OAIster (University of Michigan Libraries)
Public Knowledge Harvester (U. of British Columbia
TORII (Trieste, Italy)

Open Linguistic Archives Consortium (OLAC)

Open Linguistic Archives Consortium

OLAC (http://www.language-archives.org/) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by:

developing consensus on best current practice for the digital archiving of language resources, and
developing a network of interoperating repositories and services for housing and accessing such resources.

"Any user on the Internet should be able go to a single gateway to find all the language resources available at all participating institutions, whether the resources be data, tools, or advice. The community will ensure on-going interoperation and quality by following standards for the metadata that describe resources and services and for processes that review them."

(From http://www.language-archives.org/OLAC/process.html)

Migrating from OAi to OLAC

The implementer of an OLAC data provider must implement the OAi protocol, plus implement three additional features. The additions are:

support the OAi format for unique identifiers of records
supply an OLAC-specific archive description
support the OLAC-specific metadata standard

OLAC Metadata Set

(From http://www.language-archives.org/OLAC/olacms.html)

Title
Creator
Subject
Subject.language: A language which the content of the resource describes or discusses.
Description
Publisher
Contributor
Date
Type
Type.functionality: Software Functionality
Type.linguistic: The nature or genre of the content of the resource from a linguistic standpoint.
Format
Format.cpu: The CPU required to use a software resource.
Format.encoding: An encoded character set used by a digital resource.
Format.markup: A markup scheme used by a digital resource.
Format.os: An operating system required to use a software resource.
Format.sourcecode: A programming language of software distributed in source form.
Identifier
Source
Language
Relation
Coverage
Rights

Language Typology Resource Center (LTRC)

Language Typology Resource Center

http://www-uilots.let.uu.nl/td/LTRC/

The Language Typology Research Center is a project that is funded by the European Community, consisting of a "thematic network" to be carried out in the framework of the specific research and technological development programme "Improving the Human Research Potential and the Socio-Economic Knowledge Base".

Aim of the LTRC Project

Aim of the initiative is to create a web-accessible electronic archive for typological description, including powerful research tools such as typological databases, language-typological expert systems, extensive scientific grammars, and corpora.

Objectives of the LTRC project

to combine expertise available with the different partners in the research network in developing typological databases;
to collect relevant information on available databases and datasets in Europe;
to stimulate the conversion of a number of existing databases (digitised or otherwise) into standardised formats
to disseminate expertise from the participants in setting up databases and solving technical and fundamental issues
to encourage establishing standards that the linguistic typological community will adhere to in creating databases.
to acquire and develop the appropriate metadata set information, and appropriate software for accessing databases, either directly or through a meta database system as part of a Language Typology Resource Centre;
to make the results available to the linguistic community via a Language Typology Resource Centre Website.

Network Members

Netherlands Graduate School in Linguistics (LOT) (wwwlot.let.uu.nl)
Linguistics department, Max Planck Institute for Evolutionary Anthropology (www.eva.mpg.de/lingua/)
Institut für Englische Philologie, Freie Universität Berlin (http://www.fu-berlin.de/)
Department of Linguistics, Stockholm University (www.ling.su.se/)
Department of Linguistics and International studies, University of Surrey (www.surrey.ac.uk/LIS/)
Department of Computer Science, University of Lisbon (www.di.fc.ul.pt/)
Sprachbaupläne, University of Konstanz (ling.uni-konstanz.de/)
Department of Linguistics and Modern English Language, University of Lancaster (http://www.ling.lancs.ac.uk/)
Zentrum für Allgemeine Sprachwissenschaft Typologie und Universalienforschung (Berlin) (http://www.zas.gwz-berlin.de/)

Databases involved in the LTRC

Intensifiers/Emphatic reflexives (and reflexives) (Freie Universität Berlin) [XML metadata]
Surrey Syncretism Database (University of Surrey) [XML metadata]
Utrecht Linguistic Database, Aspect (University of Utrecht) [XML metadata]
Nexing Corpus (University of Lisbon/Coimbra)[XML metadata]
StressTyp (University of Utrecht) [XML metadata]
Person Agreement Database (University of Amsterdam / Lancaster)[XML metadata]
Spinoza Areal Database (Leiden University)[XML metadata]
The Intonational Lexicon: Contours and Alignment in Read Texts (Konstanz)
Database Sprachbaupläne/Universals Archive (Konstanz)
Dualis constructions database (Konstanz)
Number use in language: a quantitative and typological investigation (Surrey).
Anaphora database (Language in Use project) (Utrecht)
World Atlas of Language Structures (Leipzig)
Word order databse (Amsterdam)
Evidentiality database (Amsterdam)
Comparison database (Nijmegen)
Intransitive predication database (Nijmegen)

Typological Database Project (TDP)

Typological Database Project

http://www-uilots.let.uu.nl/td/

The Typological Database Project (TDP) is funded by the Netherlands Organisation for Scientific Research (NWO) and like LTRC is also chaired by the Netherlands Graduate School of Linguistics (LOT) at Utrecht University.

The project is currently 'in between funding'. The pilot project ended last year, and although the renewal application got A-status, there was no money for new funding this year. Another reapplication is underway.

Partners

Utrecht institute of Linguistics OTS, Utrecht University (www-uilots.let.uu.nl)
Language and Cognition group, Max Planck Institute, Nijmegen (http://www.mpi.nl/)
Department of Linguistics, University of Amsterdam (http://www.hum.uva.nl/)
Department of Linguistics Nijmegen University (atd.let.kun.nl/)
Department of Linguistics, Leiden University (www.leidenuniv.nl/let/)
Grammatical Models group, Tilburg University (http://www.kub.nl/)

Aim of the TDS project

Create a Typological Database System that allows the user

to access a large number of databases through one single interface
to combine the results in an intelligent way.

Example

StressTyp

Contents: Stress systems
Coverage: about 500 languages from all over the world
Software: 4th Dimension software, on a server at Leyden University
Chomskyan tradition

Eurotype project: word order

Contents: Word order
Coverage: 150 European languages
Software: own database application in Pascal, working on a conversion to Microsoft Access. Not on-line.
Functional Grammar (Simon Dik)

TDS Architecture

Copyright: Typological Database Project / Utrecht University

Goals

Support and promote the development of typological databases, both phenomenon-specific ones and general ones.
Develop a linguistic metalanguage, that is, a list of relevant terminology, as well as a list of relevant phenomena.
Develop content metadata to describe the content and the format of the various databases participating in the project, that is, the relevant metadata elements and the appropriate vocabulary, to aim at a common standard.
Development of an appropriate user interface.

Summary

Metadata initiatives benefit from standards
- Character encoding
- Software encoding
- Linguistic terminology
XML seems suited for at least the metadata level

Needle, haystack ...?

Perhaps in the not to far future, we will find that needle in the haystack?

Original from http://www.apax.com/images/v1/Haystack-FINALb.jpg