N.B. This text borrows a lot from Mark Martinec's recommendations (in Slovenian) on usage of Slovenian non-ASCII characters in HTML documents.
Content-Type line). This is the only information
browser gets and must reflect the true content of the document.
Content-Type: text/html
charset token in the
Content-Type line of MIME header. Any character set
registered by the IANA Character Set registry [2]
is a valid option. For the practical purposes, HTTP
specification defines the following set of names that are most
likely to be used with HTTP entities:
charset = "US-ASCII"
| "ISO-8859-1" | "ISO-8859-2" | "ISO-8859-3"
| "ISO-8859-4" | "ISO-8859-5" | "ISO-8859-6"
| "ISO-8859-7" | "ISO-8859-8" | "ISO-8859-9"
| "ISO-2022-JP" | "ISO-2022-JP-2" | "ISO-2022-KR"
| "UNICODE-1-1" | "UNICODE-1-1-UTF-7" | "UNICODE-1-1-UTF-8"
| token
Any token that has a predefined value within the IANA Character Set
registry must represent the character set defined by that
registry. An empty token implicitely assumes ISO 8859-1 (or
US-ASCII, which is a subset of it).
For ISO 8859-2, for example, the header must contain:
Content-Type: text/html; charset=ISO-8859-2The space between the semicolon and
charset=ISO-8859-2 is
optional.
When the usage of ISO 8859-2 encoding is advised, and when it is not? Within an intranet (the in-house information system) in an enviroment where ISO 8859-2 encoding is used, one might freely choose to encode all the documents in ISO 8859-2. If the intended audience is broader, however, one has to realize that ISO 8859-2 encoding is seldom used outside the Central and Eastern Europe, and that the current HTTP specification does not require a browser to be capable to display a document encoded in ISO 8859-2. When unable to show the document in the requested encoding, a browser can either ignore the problem and show it encoded incorrectly, or refuse to show it and offer to save it to disk. It may be therefore advisable to avoid any non-ISO 8859-1 characters on the "entry point" of your information system.
Several methods of providing national character set using ISO 8859-2 encoding are reviewed:
Upon a document request, most WWW servers deduce the content
of a static document from the filename suffix, and build the
appropriate MIME header for the outbound document. Several
suffices can denote the same document type (e.g.
.jpeg and .jpg both denote
image/jpeg), but not vice versa.
There is no general way which would allow us to use
the same suffix .html for the documents encoded in
both ISO 8859-1 and ISO 8859-2, and expect the server to somehow
decide which is which. One possibility is to tag every
HTML document with
Content-Type: text/html; charset=ISO-8859-2. This
method however has obvious disadvantages for audience unable to
display this character set.
The solution proposed here is to use the ability of defining
an additional file suffix and bind it to document type text/html;
charset=ISO-8859-2. This is an option supported by most
WWW servers.
Both CERN and NCSA httpd servers use
AddType
command in configuration file to bind additional content types to
new filename suffices. Using a
clever yet simple
scheme we can provide appropriate MIME headers for
the documents written in encodings other than 8859-1.
For instance, if we decide to use suffix .html-l2
for textual documents encoded according to ISO 8859-2, we can
do this like:
httpd.conf:
AddType .html-l2 text/html;charset=ISO-8859-2 8bit 1.0
srm.conf
AddType text/html;charset=ISO-8859-2 .html-l2
Ready for an example?
On any WWW server supporting the Common Gateway Interface (CGI) one can implement on-the-fly re-encoding of the documents. Besides its good points (providing many encodings from a single document), the method also has its drawbacks - processing takes CPU time on the server side, and CGI documents are usually not cached by proxies.
Example scripts for on-the-fly re-encoding:
Following are three solutions that are platform-dependent. The first two are bound to two popular httpd servers, and the last one to Netscape Navigator client. Another proposal for solution of this problem is "MIME Header Supplemented File Type" draft [3].
W3C WWW server (used to be known as CERN WWW server) makes it possible to add meta-information to the MIME headers of outbound documents.
If we accept the default setup, httpd expects
meta-information files in the .web subdirectory,
ending with .meta suffix. For example let us assume
that we have our documents stored in /WWW/Hypertext
directory and we want to tag a document
isolatin2.html with proper MIME header.
First, we create the appropriate subdirectory (if it is not already existing)
% mkdir /WWW/Hypertext/.web
Then, we create a new file in this subdirectory and name it
isolatin2.html.meta. This file must contain
RFC822-style headers. In our case, it contains one single
line:
Content-Type: text/html; charset=ISO-8859-2
That's all! If you want, you can try how it works!
The Apache HTTP server specifies a special file type called ASIS. The server sends ASIS ("as-is") files are directly to the client without adding HTTP headers, so they are expected to contain all the fields required by the HTTP protocol, followed by a blank line and the HTML document content.
In the server configuration file, define a new mime type
called httpd/send-as-is, e.g.
AddType httpd/send-as-is asisThis binds the files with
.asis suffix to the
httpd/send-as-is MIME type. An ASIS file might look
like
Status: 200 OK Content-Type: text/html; charset=ISO-8859-2 <HTML> <HEAD> <TITLE>Hello world</TITLE> </HEAD> <BODY> <H1>Cześć świat</H1> </BODY> </HTML>
As you can see, we must not forget the three-digit HTTP
response code. The server always adds the Date: and
Server: HTTP fields to the data returned to the
client, so these should not be included in the ASIS
file.
Please also note we cannot use the
&#nnn; form, since this
notation refers to the nnn-th glyph in ISO 8859-1 character
set rather than in the currently selected character set. The
proposed HTML Internationalization draft
[4] extends this notation to Unicode,
so some day we will be able to refer to lowercase letter c with
caron accent as č, for instance.
Alis Technologies, Inc. has modified NCSA HTTP server code and added some support for non-ISO 8859-1. The two features added are the transmission of Charset parameter with Content-Type field and the implementation of the Accept-Language protocol. Their approach was to add a SGML-style header including meta-information in the beginning of HTML document.
An example document might look like:
<!--Des_champs_pseudo-MIME_suivent Content-Type: text/html; charset=ISO-8859-1 Content-Language: fr --> <HTML> <HEAD> . . . </BODY> </HTML>
Modified server code is available on Alis FTP site.
HTTP 2.0 specification [5] has proposed
<META> tag with HTTP-EQUIV
attribute as a non-mandatory method by which the server can
extract meta-information from the content of the document
<HEAD> to generate MIME header fields.
To date and to my knowledge, no server implements this feature
yet. Instead, some clients (the only one known to me is
Netscape Navigator 2.0)
started to use this field.
A document enhanced with meta-information looks like
<HTML>
<HEAD>
<TITLE>Your title here<TITLE>
<META HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=ISO-8859-2">
</HEAD>
<BODY>
Your text here...
That's it! You can try it now! If you are running Netscape Navigator 2.0 or later, you can select "Document Info" in the "View" pull-down menu and check document encoding.
Created 1996-02-24 by
P. Peterlin
Last revision $Date: 2001/02/20 20:01:11 $ ($Author: gnusl $)
Back to ISO 8859-2 Resources