fix 509 + tutorial: how to generate icudt51l from source
17 Aug 2020 | icu.dat fix tutorialissue 509 says that Collator
api fails. Short investigations shows that the root case is missing files for collator/normalizer in icu.dat
. Simple solution for user is to generate user files and use it as part of application. But icu data customizer is not available anymore online and python scripts for customization are not available for old v51.
icu.dat
was built manualy and delivered as PR510. Steps how to do it described bellow.
Tutorial. Building icu.dat
from sources
Download sources and data files
v51 sources are available at github release page. Pick both source and data packages:
- icu4c-51_3-src.tgz
- icu4c-51_3-data.zip
Unpack the source and data packages. As result there should be two folders:
icu
data
Replace data folder
Sources come with pre-build icudt51l.dat
in icu/source/data/in
and build script will use it. To build own customized dat data
folder with source files from data package should be used. Replace:
rm -rf icu/source/data
cp -r data/ icu/source/data
Before customiztion: backup
List of entities to be included in icudt51l.dat
are specified in .mk
files. Before touching these lets backup:
cd icu/source/data
cp coll/colfiles.mk coll/colfiles.mk.bak
cp brkitr/brkfiles.mk brkitr/brkfiles.mk.bak
cp curr/resfiles.mk curr/resfiles.mk.bak
cp lang/resfiles.mk lang/resfiles.mk.bak
cp locales/resfiles.mk locales/resfiles.mk.bak
cp rbnf/rbnffiles.mk rbnf/rbnffiles.mk.bak
cp region/resfiles.mk region/resfiles.mk.bak
cp zone/resfiles.mk zone/resfiles.mk.bak
cp mappings/ucmcore.mk mappings/ucmcore.mk.bak
cp mappings/ucmebcdic.mk mappings/ucmebcdic.mk.bak
cp mappings/ucmfiles.mk mappings/ucmfiles.mk.bak
cp sprep/sprepfiles.mk sprep/sprepfiles.mk.bak
cp misc/miscfiles.mk misc/miscfiles.mk.bak
Customizing data
To fix issue 509 collator and normalizer files needs to be added. But first at all target icudt51l.dat
should be kept as minimal as possible to reduce foot print. Will try to keep existing structure of it as described in previous post.
ICU4C Footprint page recommends just delete not required .mk
files. However, to keep structure similar to existing ones and reduce risk of new issue will keep root
and en
entries. Other entries will be removed as instructed:
Removing entries:
cd icu/source/data
rm brkitr/brkfiles.mk
rm mappings/ucmfiles.mk
rm mappings/ucmcore.mk
rm mappings/ucmebcdic.mk
rm translit/trnsfiles.mk
Keeping root
and en
files. Replace content of following files:
cd icu/source/data
echo "GENRB_SOURCE = en.txt" > locales/resfiles.mk
echo "MISC_SOURCE = supplementalData.txt likelySubtags.txt icuver.txt icustd.txt numberingSystems.txt" > misc/miscfiles.mk
echo "RBNF_SOURCE = en.txt" > rbnf/rbnffiles.mk
echo "CURR_SOURCE = en.txt" > curr/resfiles.mk
echo "ZONE_SOURCE = en.txt" > zone/resfiles.mk
echo "LANG_SOURCE = en.txt" > lang/resfiles.mk
echo "REGION_SOURCE = en.txt" > region/resfiles.mk
echo "SPREP_SOURCE = rfc3491.txt" > sprep/sprepfiles.mk
# adding collator
echo "COLLATION_SOURCE = en.txt" > coll/colfiles.mk
NB: To add/keep other languages just list language file to the list. Check backup of .mk
files as reference.
Removing extra files
As mentioned in ICU4C Footprint beside .mk
files makefile should be tweaked to remove ever more files. To keep data file same as it was before following files need to be not packed:
- confusables.cfu
- ibm-1047_P100-1995.cnv
- ibm-37_P100-1995.cnv
- unames.icu
Following lines to be modified in icu/source/data/Makefile.in
into:
line 243:
DAT_FILES_SHORT=cnvalias.icu coll/ucadata.icu coll/invuca.icu nfc.nrm nfkc.nrm nfkc_cf.nrm uts46.nrm
line 265:
ALL_CFU_SOURCE= CFU_FILES_SHORT=
CFU_FILES=
line 274:
ALL_UCM_SOURCE=$(UCM_SOURCE_CORE) $(UCM_SOURCE_FILES) $(UCM_SOURCE_EBCDIC) $(UCM_SOURCE_LOCAL)
Also Collation Customization page says how to reduce collation data:
Collation rule strings in general are not commonly used but are a significant portion of the data size in ICU collation resource bundles, especially for CJK languages. The rule strings can be omitted from those resource bundles by adding the –omitCollationRules option to the relevant genrb invocations (e.g., in ICU’s source/data/Makefile.in).
line 656:
$(INVOKE) $(TOOLBINDIR)/genrb $(GENRBOPTS) --omitCollationRules -i $(BUILDDIR) -s $(COLSRCDIR) -d $(COLBLDDIR) $(<F) line 658: $(INVOKE) $(TOOLBINDIR)/genrb $(GENRBOPTS) --omitCollationRules -i $(BUILDDIR) -s $(OUTTMPDIR)/$(COLLATION_TREE) -d $(COLBLDDIR) $(INDEX_NAME).txt
Building
Fix line 29 in icu/src/data/pkgdataMakefile.in
to look as bellow:
@echo LD_SONAME="$(LD_SONAME)" >> $(OUTPUTFILE)
Configure and build:
cd icu/source
./runConfigureICU –enable-debug MacOSX
make
It will build all icu
binary tools and compile data
into icu/src/data/out/tmp/icudt51l.dat
. This file might be used as custom file when added to native ios resource folder in project or to replace embedded one in robovm (check post for instructions).
After this point make
might be invoken in icu/source/data
folder to build data file only.
Result list of files in icudt51l.dat
total size of icudt51l.dat
is 1089040 bytes.
file | size |
---|---|
cnvalias.icu | 64120 |
icustd.res | 84 |
icuver.res | 120 |
likelySubtags.res | 23592 |
nfc.nrm | 33216 |
nfkc.nrm | 52224 |
nfkc_cf.nrm | 48944 |
numberingSystems.res | 3324 |
pool.res | 80 |
res_index.res | 112 |
rfc3491.spp | 20522 |
supplementalData.res | 79656 |
uts46.nrm | 55504 |
en.res | 10564 |
brkitr/res_index.res | 92 |
coll/en.res | 45984 |
coll/pool.res | 80 |
coll/res_index.res | 100 |
coll/root.res | 992 |
coll/supplementalData.res | 25852 |
lang/en.res | 26676 |
lang/pool.res | 80 |
lang/res_index.res | 100 |
lang/root.res | 296 |
rbnf/en.res | 8916 |
rbnf/res_index.res | 100 |
rbnf/root.res | 13620 |
region/en.res | 8572 |
region/pool.res | 80 |
region/res_index.res | 100 |
region/root.res | 104 |
zone/en.res | 18592 |
zone/pool.res | 80 |
zone/res_index.res | 100 |
zone/root.res | 2176 |
Comments