UiL-OTS / LOT
Summary of the previous
- Monday: XML ins and outs
- Tuesday: XML-related proposals
- Wednesday: XML software
- Thursday: Language encoding recommendations
- Today: Metadata
In the beginning ...
When the web was new and shiny, everything seemed to be brilliant, and it was
just a plain gold mine.
But then ...
But then so many people started piling their stuff on the web, that it became
very hard to find what you were looking for.
The result ...
The result was a haystack in which every needle gets lost.
- Primary aim: increase accessibility of existing archives
- Secondary aim: promote the use of open standards
- Original beneficiaries: E-print archives
Metadata proposals & projects
- DCMI: Dublin Core Metadata Initiative (1995)
- OAi: Open Archives Initiative (1999)
- OLAC: Open Linguistic Archives Community (2001)
- LTRC: Language Typology Resource Center (2002)
- TDP: Typological Database System (2003?)
Dublin Core Metadata Initiative (DCMI)
Dublin Core Metadata Initiative (1)
The Dublin Core Metadata Initiative (dublincore.org) was founded during a joint
workshop of the National Center for Supercomputing Applications (NCSA) and the
Online Computer Library Center (OCLC) that was held in Dublin, Ohio, March 1995.
The aim was a core set of semantics for Web-based resources would be
extremely useful for categorizing the Web for easier search and retrieval.
Dublin Core Metadata Initiative (2)
Quote from http://dublincore.org/about/:
"The Dublin Core Metadata Initiative (DCMI) is an organization
dedicated to promoting the widespread adoption of interoperable metadata
standards and developing specialized metadata vocabularies for describing
resources that enable more intelligent information discovery systems."
Dublin Core Specifications
There are four kinds of documents:
- DCMI Recommendations
- Specifications are stable and are supported for adoption.
- DCMI Proposed Recommendations
- Growing support for adoption; specifications are close to stable.
- DCMI Working Drafts
- For review.
- For discussion only.
Dublin Core Element Set (DCES)
- A name given to the resource.
- An entity primarily responsible for making the content of the resource.
- The topic of the content of the resource.
- An account of the content of the resource.
- An entity responsible for making the resource available.
- An entity responsible for making contributions to the content of the
- A date associated with an event in the life cycle of the resource.
- The nature or genre of the content of the resource.
- The physical or digital manifestation of the resource.
- An unambiguous reference to the resource within a given context.
- A Reference to a resource from which the present resource is derived.
- A language of the intellectual content of the resource.
- A reference to a related resource.
- The extent or scope of the content of the resource.
- Information about rights held in and over the resource.
- Subject and Keywords
- Registration Authority
- Dublin Core Metadata Initiative
- The topic of the content of the resource.
- Character String
- Maximum Occurrence
- Typically, a Subject will be expressed as keywords, key phrases or
classification codes that describe a topic of the resource. Recommended best
practice is to select a value from a controlled vocabulary or formal
Open Archives Initiative (OAi)
Open Archives Initiative (1)
The Open Archives Initiative (http://www.openarchives.org/
established in 1999 under the joint sponsorship of the:
- CLIR: Council on Library and
- DLF: Digital Library Federation
- SPARC: Scholarly Publishing &
Academic Resources Coalition
- ARL: Association of Research Libraries
- LANL: Los Alamos National Library
Open Archives Initiative (2)
Quote from http://www.openarchives.org/organization/:
"The Open Archives Initiative develops and promotes
interoperability standards that aim to facilitate the efficient dissemination
of content. The Open Archives Initiative has its roots in an effort to enhance
access to e-print archives as a means of increasing the availability of
- A Celebration of Women Writers
- Academia Sinica Balanced Corpus of Modern Chinese
- Academia Sinica Formosan Language Archive
- Academia Sinica Tagged Corpus of Early Mandarin Chinese
- Ackerman Archives: Experimental File-based OAI Archive
- Alaska Native Language Center
- Alex Catalogue of Electronic Texts
- Archive Lyon 2
- Articles en ligne Jean Nicod
- ATILF Resources
- BioMed Central
- California Digital Library Repository 1
- California International and Area Studies Digital Repository
- Caltech Computer Science Technical Reports
- Caltech Earthquake Engineering Research Laboratory Technical Reports
- Caltech Electronic Theses and Dissertations
- Chemistry Preprint Server
- CIMI Metadata Harvesting Working Group Demonstration Repository
- Cognitive Science Data Archive
- Comparative Bantu Online Dictionary (CBOLD)
- Computer Science Teaching Center
- conoZe: intelligere ut credas, credere ut intelligas
- Dermatology Digital Repository
- Digital Library of the Commons
- DUETT - Dissertations and other Documents of the
- E-Numerate RDL Header Collection Prototype
- Elektronisches Dokumenten-, Archivierungs- und Retrievalsystem der
- Eprint Archive
- EPSILON EPrints2 Dissertation Test Archive
- ePub-WU OAI Archive (Vienna Univ. of Econ. and B.A.)
- ETD Individuals
- Ethnologue: Languages of the World
- Fourth International Symposium on Cavitation
- Groningen University Library
- Hochschulschriftenserver (HSSS) der SLUB Dresden
- Hong Kong University Theses Online
- Humboldt University of Berlin, GERMANY, Document Server
- Ibiblio Collection Index
- Arc (Old Dominion University)
- citebaseSearch (Southampton University
- DP9 (Old Dominion University)
- NCSTRL (Old Dominion University, University of Virginia)
- OAIster (University of Michigan Libraries)
- Public Knowledge Harvester (U. of British Columbia
- TORII (Trieste, Italy)
Open Linguistic Archives Consortium (OLAC)
Open Linguistic Archives Consortium
is an international partnership of institutions and individuals who are creating
a worldwide virtual library of language resources by:
- developing consensus on best current practice for the digital archiving of
language resources, and
- developing a network of interoperating repositories and services for
housing and accessing such resources.
"Any user on the Internet should be able go to a single gateway to
find all the language resources available at all participating institutions,
whether the resources be data, tools, or advice. The community will ensure
on-going interoperation and quality by following standards for the metadata
that describe resources and services and for processes that review them."
Migrating from OAi to OLAC
The implementer of an OLAC data provider must implement the OAi protocol,
plus implement three additional features. The additions are:
- support the OAi format for unique identifiers of records
- supply an OLAC-specific archive description
- support the OLAC-specific metadata standard
OLAC Metadata Set
- A language which the content of the resource describes or
- Software Functionality
- The nature or genre of the content of the resource from a linguistic
- The CPU required to use a software resource.
- An encoded character set used by a digital resource.
- A markup scheme used by a digital resource.
- An operating system required to use a software resource.
- A programming language of software distributed in source form.
Language Typology Resource Center (LTRC)
Language Typology Resource Center
The Language Typology Research Center is a project that is funded by the
European Community, consisting of a "thematic network" to be carried out in the
framework of the specific research and technological development programme
"Improving the Human Research Potential and the Socio-Economic Knowledge Base".
Aim of the LTRC Project
Aim of the initiative is to create a web-accessible electronic archive for
typological description, including powerful research tools such as typological
databases, language-typological expert systems, extensive scientific grammars,
Objectives of the LTRC project
- to combine expertise available with the different partners in the research
network in developing typological databases;
- to collect relevant information on available databases and datasets in
- to stimulate the conversion of a number of existing databases (digitised
or otherwise) into standardised formats
- to disseminate expertise from the participants in setting up databases and
solving technical and fundamental issues
- to encourage establishing standards that the linguistic typological
community will adhere to in creating databases.
- to acquire and develop the appropriate metadata set information, and
appropriate software for accessing databases, either directly or through a
meta database system as part of a Language Typology Resource Centre;
- to make the results available to the linguistic community via a Language
Typology Resource Centre Website.
Databases involved in the LTRC
- Intensifiers/Emphatic reflexives (and reflexives) (Freie Universität
- Surrey Syncretism Database (University of Surrey) [XML metadata]
- Utrecht Linguistic Database, Aspect (University of Utrecht) [XML metadata]
- Nexing Corpus (University of Lisbon/Coimbra)[XML metadata]
- StressTyp (University of Utrecht) [XML metadata]
- Person Agreement Database (University of Amsterdam / Lancaster)[XML
- Spinoza Areal Database (Leiden University)[XML metadata]
- The Intonational Lexicon: Contours and Alignment in Read Texts (Konstanz)
- Database Sprachbaupläne/Universals Archive (Konstanz)
- Dualis constructions database (Konstanz)
- Number use in language: a quantitative and typological investigation
- Anaphora database (Language in Use project) (Utrecht)
- World Atlas of Language Structures (Leipzig)
- Word order databse (Amsterdam)
- Evidentiality database (Amsterdam)
- Comparison database (Nijmegen)
- Intransitive predication database (Nijmegen)
Typological Database Project (TDP)
Typological Database Project
The Typological Database Project (TDP) is funded by the Netherlands
Organisation for Scientific Research (NWO) and like LTRC is also chaired by the
Netherlands Graduate School of Linguistics (LOT) at Utrecht University.
The project is currently 'in between funding'. The pilot project ended last
year, and although the renewal application got A-status, there was no money for
new funding this year. Another reapplication is underway.
Aim of the TDS project
Create a Typological Database System that allows the user
- to access a large number of databases through one single interface
- to combine the results in an intelligent way.
- Contents: Stress systems
- Coverage: about 500 languages from all over the world
- Software: 4th Dimension software, on a server at Leyden University
- Chomskyan tradition
Eurotype project: word order
- Contents: Word order
- Coverage: 150 European languages
- Software: own database application in Pascal, working on a conversion to
Microsoft Access. Not on-line.
- Functional Grammar (Simon Dik)
- Support and promote the development of typological databases, both
phenomenon-specific ones and general ones.
- Develop a linguistic metalanguage, that is, a list of relevant
terminology, as well as a list of relevant phenomena.
- Develop content metadata to describe the content and the format of the
various databases participating in the project, that is, the relevant metadata
elements and the appropriate vocabulary, to aim at a common standard.
- Development of an appropriate user interface.
- Metadata initiatives benefit from standards
- Character encoding
- Software encoding
- Linguistic terminology
- XML seems suited for at least the metadata level
Needle, haystack ...?
Perhaps in the not to far future, we will find that needle in the