Unicode and UTF-8
Problems with Unicode and UTF-8 in Xierpa
During the development of Xierpa, the use of unicode and UTF-8 strings was mixed inside the application.In practice this was the cause for a lot of errors and confusion. Python does not allow combining basestring and UTF-8 strings. Instead all merging strings have to be converted using unicode(utf8string, 'utf-8') (from UTF-8 to unicode) of unicodestring.encode('utf-8') (from unicode to UTF-8).
The strings must have the right format or otherwise an error is raised.
To avoid all this, Xierpa needs to be UTF-8 clean, all internal strings must be unicode. Only at periphiral connections UTF-8 can be encoded for use outside the webserver, such as XHTML code, database connection, files and form input. Xierpa will do all (most) of the periphiralconversions automatic.
Warning: The original transformer tool unicodify does no longer exist.
Aspects to notify
- Python sources and XML documents should be of type “Unix, UTF-8, No BOM”.
- In Python source strings can be either "abc" or u"abc" if they only contain low-level ASCII characters. With any usage of other unicode characters the Python string must be of type unicode as in u"København".
- All conversions to and from a database flow through the tools/transformer functions
sql2unicode(value) and s2sql(value). Not that thesed functions don’t perform
symmetric behaviour. The reason is that reading a selection from the datebase generates a set of records,
while writing is always based on a (UTF-8) query string.
There is a current bug still in the Record reading for related field values.function unicode other sql2unicode(v) Convert v to a unicode string if it is an instance of basestring. Replace all double single quotes ('') by a single quote. This is done to prevent any hacking of queries. s2sql(v) Don’t convert v to a UTF-8 string, this is done with the total query by self.agent. If the value is not an instance of basestring then convert it to a string first.
Then .Replace all single quotes by a double single quote ('') . This is done to prevent any hacking of queries. - The XsltParser.xml() method only takes unicode strings. An error will be raised if the xml attribute has a different type.
- The XmlParser.parse() method only takes unicode strings. An error will be raised if the xml attribute has a different type.
- The periphiral communication methods Agent.query(q) and Agent.getquery(q) convert the unicode query q attribute to UTF-8 before calling the database driver.
- The Mailer.mailto instance (as called from XierpaBuilder.mailto) requires the subject and message attributes both be of type unicode. Otherwise an error is raised.
