From -4079186349999320172 X-Google-Language: ENGLISH,ASCII X-Google-Thread: f78e5,9f2e16b466e08e7 X-Google-Attributes: gidf78e5,public X-Google-ArrivalTime: 2002-06-12 09:10:08 PST Path: archiver1.google.com!news2.google.com!news1.google.com!newsfeed.stanford.edu!logbridge.uoregon.edu!kibo.news.demon.net!news.demon.co.uk!demon!mail2news.demon.co.uk!not-for-mail From: kanze@gabi-soft.de (James Kanze) Newsgroups: comp.std.c++ Subject: Re: C++0x Wish list (Wide Char bit maps) Date: Wed, 12 Jun 2002 16:09:32 GMT Organization: http://groups.google.com/ Approved: Fergus Henderson , moderator of comp.std.c++ Message-ID: References: <86y9eaqbmb.fsf@alex.gabi-soft.de> <86adqjl0uv.fsf@alex.gabi-soft.de> <86y9e3nig9.fsf@alex.gabi-soft.de> <3D05CCA4.D7E9CE59@BoM.GOV.AU> X-Trace: mail2news.demon.co.uk 1023898177 mail2news:7573 mail2news mail2news.demon.co.uk X-Complaints-To: abuse@demon.net X-Mail2News-Path: news.demon.net!mulga.cs.mu.oz.au X-Authentication-Warning: mulga.cs.mu.OZ.AU: fjh set sender to devnull@stump.algebra.com using -f X-Robomod: STUMP, ichudov@algebra.com (Igor Chudov) Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit NNTP-Posting-Date: 12 Jun 2002 13:12:25 GMT Lines: 74 Xref: archiver1.google.com comp.std.c++:11976 Charles Sanders wrote in message news:<3D05CCA4.D7E9CE59@BoM.GOV.AU>... > James Kanze wrote: > > > I have a number of cases where I represent sets of characters > > using a bit map. It is obvious that a bit map is NOT an > > appropriate representation when the domain contains over a > > million elements -- each bit map would require more that 128 K > > bytes. But what is the appropriate representation? For the > > moment, I'm using a list of ranges. > > > Anyhow, if anyone has any ideas about other representations, I'm all > > ears. > One representation that I saw used for land/sea masks, which, > like the character class bitmaps, tend to have groups of similar > bits near each other (ie the middle of the Pacific Ocean is (nearly) > all sea, was an array of bit maps. One of the problems with deciding on a representation is that I don't know, a priori, how the bits will be distributed. Still, two frequent cases probably predominate: a single bit (to match a single character), or a predefined set of bits (e.g. [:alpha:]). With regards to the latter, I'll have to check to see whether there really are large blocks. I know that in some cases, capitals and small letters alternate, which means that something like [:lower:] will have every other bit set. On the other hand, this should only affect alphabets which have upper and lower case, which corresponds to a small minority of all characters. Presumably, some clever processing on the Unicode data file should be able to calculate an optimal size for the predefined sets, and since most of the others will be just a single characters, this might be the best solution. I had considered using something like this, but with dynamically determined ranges. The problem was always how to optimally determine the ranges, especially after logical operations like union or intersection. (One potentially common case might be [_:alpha:], for example -- all of the alpha characters, plus '_'. Internally, when parsing this, :alpha: leads to a static const, which gets or'ed into the previous set consisting of the '_'.) Using fixed size ranges should make this somewhat easier. [...] > The optimum value for BitsPerMap will obviously depend on the > actual distribution of 1 and 0 bits, and will be different for > different character classes. Also, if it is possible to keep the > number of actual maps needed to 254 or less, the map indices can be > stored in an unsigned char, reducing memory usage even more. > Similarly, in some cases, some of the maps may happen to be equal > because of repeated patterns of bits, and it may be possible to > share the bit maps. It should also be possible to try various values > of BitsPerMap and find the optimum. For starters, one might just try using 128 or 256. I think that many of the Unicode blocks are aligned at these values. At any rate, some experimentation is necessary (and I don't currently have time to do it). -- James Kanze mailto:kanze@gabi-soft.de Conseils en informatique orient�e objet/ Beratung in objektorientierter Datenverarbeitung Ziegelh�ttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)69 63198627 --- [ comp.std.c++ is moderated. To submit articles, try just posting with ] [ your news-reader. If that fails, use mailto:std-c++@ncar.ucar.edu ] [ --- Please see the FAQ before posting. --- ] [ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]