Configuring WWW Server for ISO 8859-2

N.B. This text borrows a lot from Mark Martinec's recommendations (in Slovenian) on usage of Slovenian non-ASCII characters in HTML documents.

Principles

  1. HTTP specification [1] requires HTTP transport protocol to transmit a data stream without imposing any limitations to ASCII or printable characters. This includes 8-bit characters, 16-bit characters (in case of ISO 10646 or Eastern languages), images, movie, sound etc.
  2. The content of the document is determined by its MIME header (Content-Type line). This is the only information browser gets and must reflect the true content of the document.
  3. The default character encoding (MIME/HTTP terminology uses the term "character set") for HTML text files is ISO 8859-1. Every WWW browser must be capable of displaying the documents encoded in ISO 8859-1. MIME header for HTML documents encoded in ISO 8859-1 is
    Content-Type: text/html
    
  4. HTTP specification allows character encoding other than default to be specified using charset token in the Content-Type line of MIME header. Any character set registered by the IANA Character Set registry [2] is a valid option. For the practical purposes, HTTP specification defines the following set of names that are most likely to be used with HTTP entities:
    charset = "US-ASCII"
            | "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
            | "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
            | "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
            | "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
            | "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
            | token
    
    Any token that has a predefined value within the IANA Character Set registry must represent the character set defined by that registry. An empty token implicitely assumes ISO 8859-1 (or US-ASCII, which is a subset of it).

    For ISO 8859-2, for example, the header must contain:

    Content-Type: text/html; charset=ISO-8859-2
    
    The space between the semicolon and charset=ISO-8859-2 is optional.

Practical Considerations

When the usage of ISO 8859-2 encoding is advised, and when it is not? Within an intranet (the in-house information system) in an enviroment where ISO 8859-2 encoding is used, one might freely choose to encode all the documents in ISO 8859-2. If the intended audience is broader, however, one has to realize that ISO 8859-2 encoding is seldom used outside the Central and Eastern Europe, and that the current HTTP specification does not require a browser to be capable to display a document encoded in ISO 8859-2. When unable to show the document in the requested encoding, a browser can either ignore the problem and show it encoded incorrectly, or refuse to show it and offer to save it to disk. It may be therefore advisable to avoid any non-ISO 8859-1 characters on the "entry point" of your information system.

Implementation Details

Several methods of providing national character set using ISO 8859-2 encoding are reviewed:

Defining a New Suffix for Static Documents

Upon a document request, most WWW servers deduce the content of a static document from the filename suffix, and build the appropriate MIME header for the outbound document. Several suffices can denote the same document type (e.g. .jpeg and .jpg both denote image/jpeg), but not vice versa.

There is no general way which would allow us to use the same suffix .html for the documents encoded in both ISO 8859-1 and ISO 8859-2, and expect the server to somehow decide which is which. One possibility is to tag every HTML document with Content-Type: text/html; charset=ISO-8859-2. This method however has obvious disadvantages for audience unable to display this character set.

The solution proposed here is to use the ability of defining an additional file suffix and bind it to document type text/html; charset=ISO-8859-2. This is an option supported by most WWW servers. Both CERN and NCSA httpd servers use AddType command in configuration file to bind additional content types to new filename suffices. Using a clever yet simple scheme we can provide appropriate MIME headers for the documents written in encodings other than 8859-1. For instance, if we decide to use suffix .html-l2 for textual documents encoded according to ISO 8859-2, we can do this like:

W3C httpd server (a.k.a. CERN); httpd.conf:
AddType  .html-l2     text/html;charset=ISO-8859-2  8bit 1.0
Apache httpd, NCSA httpd; srm.conf
AddType  text/html;charset=ISO-8859-2	.html-l2

Ready for an example?

On-the-fly Re-encoding

On any WWW server supporting the Common Gateway Interface (CGI) one can implement on-the-fly re-encoding of the documents. Besides its good points (providing many encodings from a single document), the method also has its drawbacks - processing takes CPU time on the server side, and CGI documents are usually not cached by proxies.

Example scripts for on-the-fly re-encoding:

Following are three solutions that are platform-dependent. The first two are bound to two popular httpd servers, and the last one to Netscape Navigator client. Another proposal for solution of this problem is "MIME Header Supplemented File Type" draft [3].

Meta-Information for W3C WWW Server

W3C WWW server (used to be known as CERN WWW server) makes it possible to add meta-information to the MIME headers of outbound documents.

If we accept the default setup, httpd expects meta-information files in the .web subdirectory, ending with .meta suffix. For example let us assume that we have our documents stored in /WWW/Hypertext directory and we want to tag a document isolatin2.html with proper MIME header.

First, we create the appropriate subdirectory (if it is not already existing)

% mkdir /WWW/Hypertext/.web

Then, we create a new file in this subdirectory and name it isolatin2.html.meta. This file must contain RFC822-style headers. In our case, it contains one single line:

Content-Type: text/html; charset=ISO-8859-2

That's all! If you want, you can try how it works!

Apache ASIS Files

The Apache HTTP server specifies a special file type called ASIS. The server sends ASIS ("as-is") files are directly to the client without adding HTTP headers, so they are expected to contain all the fields required by the HTTP protocol, followed by a blank line and the HTML document content.

In the server configuration file, define a new mime type called httpd/send-as-is, e.g.

AddType httpd/send-as-is asis
This binds the files with .asis suffix to the httpd/send-as-is MIME type. An ASIS file might look like

Status: 200 OK
Content-Type: text/html; charset=ISO-8859-2

<HTML>
<HEAD>
<TITLE>Hello world</TITLE>
</HEAD>
<BODY>
<H1>Cześć świat</H1>
</BODY>
</HTML>

As you can see, we must not forget the three-digit HTTP response code. The server always adds the Date: and Server: HTTP fields to the data returned to the client, so these should not be included in the ASIS file.

Please also note we cannot use the &#nnn; form, since this notation refers to the nnn-th glyph in ISO 8859-1 character set rather than in the currently selected character set. The proposed HTML Internationalization draft [4] extends this notation to Unicode, so some day we will be able to refer to lowercase letter c with caron accent as &#269;, for instance.

Alis HTTP Server

Alis Technologies, Inc. has modified NCSA HTTP server code and added some support for non-ISO 8859-1. The two features added are the transmission of Charset parameter with Content-Type field and the implementation of the Accept-Language protocol. Their approach was to add a SGML-style header including meta-information in the beginning of HTML document.

An example document might look like:

<!--Des_champs_pseudo-MIME_suivent
Content-Type: text/html; charset=ISO-8859-1
Content-Language: fr
-->
<HTML>
<HEAD>
. . .
</BODY>
</HTML>

Modified server code is available on Alis FTP site.

<META> Tag with HTTP-EQUIV Attribute

HTTP 2.0 specification [5] has proposed <META> tag with HTTP-EQUIV attribute as a non-mandatory method by which the server can extract meta-information from the content of the document <HEAD> to generate MIME header fields. To date and to my knowledge, no server implements this feature yet. Instead, some clients (the only one known to me is Netscape Navigator 2.0) started to use this field.

A document enhanced with meta-information looks like

<HTML>
<HEAD>
  <TITLE>Your title here<TITLE>
  <META HTTP-EQUIV="Content-Type"
      CONTENT="text/html; charset=ISO-8859-2">
</HEAD>

<BODY>
Your text here...

That's it! You can try it now! If you are running Netscape Navigator 2.0 or later, you can select "Document Info" in the "View" pull-down menu and check document encoding.

References:

[1]
T. Berners-Lee, R. Fielding, H. Frystyk, Hypertext Transfer Protocol - HTTP/1.0, RFC 1945, February 1996.
[2]
J. Reynolds and J. Postel, Assigned Numbers, STD 2, RFC 1700, USC/ISI, October 1994.
[3]
G. Nicol, MIME Header Supplemented File Type, Internet-Draft, October 1995.
[4]
F. Yergeau, G. Nicol, G. Adams and M. Duerst, Internationalization of the Hypertext Markup Language, RFC 2070, January 1997 (obsoleted by D. Connolly, L. Masinter, The 'text/html' Media Type, RFC 2854, June 2000).
[5]
T. Berners-Lee, D. Connolly, Hypertext Markup Language - 2.0, RFC 1866, MIT/W3C, November 1995.
[6]
B. Girschweiler, D. Connolly and B. Bos, World-wide Character Sets, Languages, and Writing Systems
[7]
F. Yergeau, Create your own multilangual Web site
[8]
A. Daviel, How to make a Multilingual Web server

Created 1996-02-24 by P. Peterlin
Last revision $Date: 2001/02/20 20:01:11 $ ($Author: gnusl $)
Back to ISO 8859-2 Resources