Few cents about my commits

fix 509 + tutorial: how to generate icudt51l from source

|

issue 509 says that Collator api fails. Short investigations shows that the root case is missing files for collator/normalizer in icu.dat. Simple solution for user is to generate user files and use it as part of application. But icu data customizer is not available anymore online and python scripts for customization are not available for old v51.
icu.dat was built manualy and delivered as PR510. Steps how to do it described bellow.

Tutorial. Building icu.dat from sources

Download sources and data files

v51 sources are available at github release page. Pick both source and data packages:

  • icu4c-51_3-src.tgz
  • icu4c-51_3-data.zip

Unpack the source and data packages. As result there should be two folders:

  • icu
  • data

Replace data folder

Sources come with pre-build icudt51l.dat in icu/source/data/in and build script will use it. To build own customized dat data folder with source files from data package should be used. Replace:

rm -rf icu/source/data 
cp -r data/ icu/source/data

Before customiztion: backup

List of entities to be included in icudt51l.dat are specified in .mk files. Before touching these lets backup:

cd icu/source/data 

cp coll/colfiles.mk coll/colfiles.mk.bak
cp brkitr/brkfiles.mk brkitr/brkfiles.mk.bak
cp curr/resfiles.mk curr/resfiles.mk.bak
cp lang/resfiles.mk lang/resfiles.mk.bak
cp locales/resfiles.mk locales/resfiles.mk.bak
cp rbnf/rbnffiles.mk rbnf/rbnffiles.mk.bak  
cp region/resfiles.mk region/resfiles.mk.bak
cp zone/resfiles.mk zone/resfiles.mk.bak
cp mappings/ucmcore.mk mappings/ucmcore.mk.bak
cp mappings/ucmebcdic.mk mappings/ucmebcdic.mk.bak
cp mappings/ucmfiles.mk mappings/ucmfiles.mk.bak
cp sprep/sprepfiles.mk sprep/sprepfiles.mk.bak
cp misc/miscfiles.mk misc/miscfiles.mk.bak

Customizing data

To fix issue 509 collator and normalizer files needs to be added. But first at all target icudt51l.dat should be kept as minimal as possible to reduce foot print. Will try to keep existing structure of it as described in previous post.
ICU4C Footprint page recommends just delete not required .mk files. However, to keep structure similar to existing ones and reduce risk of new issue will keep root and en entries. Other entries will be removed as instructed:
Removing entries:

cd icu/source/data 

rm brkitr/brkfiles.mk 
rm mappings/ucmfiles.mk 
rm mappings/ucmcore.mk 
rm mappings/ucmebcdic.mk 
rm translit/trnsfiles.mk 

Keeping root and en files. Replace content of following files:

cd icu/source/data 

echo "GENRB_SOURCE = en.txt" > locales/resfiles.mk 
echo "MISC_SOURCE = supplementalData.txt likelySubtags.txt icuver.txt icustd.txt numberingSystems.txt" > misc/miscfiles.mk 
echo "RBNF_SOURCE = en.txt" > rbnf/rbnffiles.mk 
echo "CURR_SOURCE = en.txt" > curr/resfiles.mk 
echo "ZONE_SOURCE = en.txt" > zone/resfiles.mk 
echo "LANG_SOURCE = en.txt" > lang/resfiles.mk 
echo "REGION_SOURCE = en.txt" > region/resfiles.mk 
echo "SPREP_SOURCE = rfc3491.txt" > sprep/sprepfiles.mk 

# adding collator
echo "COLLATION_SOURCE = en.txt" > coll/colfiles.mk 

NB: To add/keep other languages just list language file to the list. Check backup of .mk files as reference.

Removing extra files

As mentioned in ICU4C Footprint beside .mk files makefile should be tweaked to remove ever more files. To keep data file same as it was before following files need to be not packed:

  • confusables.cfu
  • ibm-1047_P100-1995.cnv
  • ibm-37_P100-1995.cnv
  • unames.icu

Following lines to be modified in icu/source/data/Makefile.in into: line 243:

DAT_FILES_SHORT=cnvalias.icu coll/ucadata.icu coll/invuca.icu nfc.nrm nfkc.nrm nfkc_cf.nrm uts46.nrm

line 265:

ALL_CFU_SOURCE= CFU_FILES_SHORT=
CFU_FILES=

line 274:

ALL_UCM_SOURCE=$(UCM_SOURCE_CORE) $(UCM_SOURCE_FILES) $(UCM_SOURCE_EBCDIC) $(UCM_SOURCE_LOCAL)

Also Collation Customization page says how to reduce collation data:

Collation rule strings in general are not commonly used but are a significant portion of the data size in ICU collation resource bundles, especially for CJK languages. The rule strings can be omitted from those resource bundles by adding the –omitCollationRules option to the relevant genrb invocations (e.g., in ICU’s source/data/Makefile.in).

line 656:

$(INVOKE) $(TOOLBINDIR)/genrb $(GENRBOPTS) --omitCollationRules -i $(BUILDDIR) -s $(COLSRCDIR) -d $(COLBLDDIR) $(<F) line 658:
$(INVOKE) $(TOOLBINDIR)/genrb $(GENRBOPTS) --omitCollationRules -i $(BUILDDIR) -s $(OUTTMPDIR)/$(COLLATION_TREE) -d $(COLBLDDIR) $(INDEX_NAME).txt

Building

Fix line 29 in icu/src/data/pkgdataMakefile.in to look as bellow:

@echo LD_SONAME="$(LD_SONAME)" >> $(OUTPUTFILE)

Configure and build:

cd icu/source
./runConfigureICU –enable-debug MacOSX
make

It will build all icu binary tools and compile data into icu/src/data/out/tmp/icudt51l.dat. This file might be used as custom file when added to native ios resource folder in project or to replace embedded one in robovm (check post for instructions).

After this point make might be invoken in icu/source/data folder to build data file only.

Result list of files in icudt51l.dat

total size of icudt51l.dat is 1089040 bytes.

file size
cnvalias.icu 64120
icustd.res 84
icuver.res 120
likelySubtags.res 23592
nfc.nrm 33216
nfkc.nrm 52224
nfkc_cf.nrm 48944
numberingSystems.res 3324
pool.res 80
res_index.res 112
rfc3491.spp 20522
supplementalData.res 79656
uts46.nrm 55504
en.res 10564
brkitr/res_index.res 92
coll/en.res 45984
coll/pool.res 80
coll/res_index.res 100
coll/root.res 992
coll/supplementalData.res 25852
lang/en.res 26676
lang/pool.res 80
lang/res_index.res 100
lang/root.res 296
rbnf/en.res 8916
rbnf/res_index.res 100
rbnf/root.res 13620
region/en.res 8572
region/pool.res 80
region/res_index.res 100
region/root.res 104
zone/en.res 18592
zone/pool.res 80
zone/res_index.res 100
zone/root.res 2176

Comments