Go to file
Anne-Edgar WILKE 6df43f8b17 Fix COMPOUNDHYPHENMIN=1 compound hyphenation
FIRST BUG
 ---------

  Problem

In a compound word, the word parts of two characters are never
hyphenated.

  Example

To reproduce the bug, just go to the directory hyphen-2.8.8 and do the
following :

echo "\
UTF-8
LEFTHYPHENMIN 1
RIGHTHYPHENMIN 1
COMPOUNDLEFTHYPHENMIN 1
COMPOUNDRIGHTHYPHENMIN 1
.post1
NEXTLEVEL
e1
a1
" > hyphen.pat

./example hyphen.pat <(echo postea)

The output is post=ea ; but it should be post=e=a.

If you replace postea with posteaque in the command above, you get
post=e=a=que, which is correct. Indeed, the component "eaque" is now
five characters long, so it is hyphenated.

If you replace postea with ea, you get e=a, which is also correct ;
this is because ea is not a compound word.

  Solution

In the file hyphen.c, line 966, "if (i - begin > 1)" must be replaced
with "if (i - begin > 0)".
Indeed, the word part is comprised between begin and i inclusively ;
its length is i - begin + 1. So, if you want to hyphenate the words
parts of length 2 and above, you have to check that i - begin + 1 >= 2,
ie i - begin > 0.

    SECOND BUG
    ----------

  Problem

In a compound word, the word parts are never hyphenated between their
second to last and their last character.

  Example

To reproduce the bug, do the following :

echo "\
UTF-8
LEFTHYPHENMIN 1
RIGHTHYPHENMIN 1
COMPOUNDLEFTHYPHENMIN 1
COMPOUNDRIGHTHYPHENMIN 1
1que.
NEXTLEVEL
e1
" > hyphen.pat

./example hyphen.pat <(echo meaque)

The output is mea=que ; but it should be me=a=que.

Again, if you replace meaque with mea, you get me=a, which is correct,
because mea is not a compound word.

If you replace meaque with eamque, you get e=am=que, as expected ; it
shows that there is no similar bug with the first and the second
character of word parts.

  Solution

In the file hyphen.c, line 983, "for (j = 0; j < i - begin - 1; j++)"
must be replaced with "for (j = 0; j < i - begin; j++)".
Indeed, the word part has length i - begin + 1. So there are i - begin
possible places for a hyphen. Thus j must take i - begin different
values, ie go from 0 to i - begin - 1.
2015-08-28 00:53:59 +02:00
doc fix coverity warnings 2012-06-29 10:02:24 +00:00
tests fdo#43931 (hard hyphen hyphenation) + fdo#54843 (rhmin fix) 2012-09-13 07:50:50 +00:00
.cvsignore add .cvsignores 2010-03-04 12:19:52 +00:00
AUTHORS sync 2.8.3 into CVS 2012-06-29 07:10:58 +00:00
COPYING Initia import 2010-03-04 12:13:53 +00:00
COPYING.LGPL Initia import 2010-03-04 12:13:53 +00:00
COPYING.MPL Initia import 2010-03-04 12:13:53 +00:00
ChangeLog coverity#58283 patterns vs MAXPATHS 2014-09-18 15:42:34 +00:00
INSTALL Initia import 2010-03-04 12:13:53 +00:00
Makefile.am hjn_hyphen_load_file patch for sandboxing by Pawel Hajdan 2013-03-18 10:49:03 +00:00
Makefile.in hjn_hyphen_load_file patch for sandboxing by Pawel Hajdan 2013-03-18 10:49:03 +00:00
NEWS bump for hyphen 2.8.8 2014-09-18 15:47:14 +00:00
README sync 2.8.3 into CVS 2012-06-29 07:10:58 +00:00
README.compound sync 2.8.3 into CVS 2012-06-29 07:10:58 +00:00
README.hyphen Initia import 2010-03-04 12:13:53 +00:00
README.nonstandard Initia import 2010-03-04 12:13:53 +00:00
README_hyph_en_US.txt Initia import 2010-03-04 12:13:53 +00:00
THANKS Initia import 2010-03-04 12:13:53 +00:00
TODO Initia import 2010-03-04 12:13:53 +00:00
aclocal.m4 fix coverity warnings 2012-06-29 10:02:24 +00:00
checkme.lst Initia import 2010-03-04 12:13:53 +00:00
config.guess Resolves: rhbz#925563 support aarch64 2014-06-27 08:37:30 +00:00
config.sub Resolves: rhbz#925563 support aarch64 2014-06-27 08:37:30 +00:00
configure bump for hyphen 2.8.8 2014-09-18 15:47:14 +00:00
configure.in bump for hyphen 2.8.8 2014-09-18 15:47:14 +00:00
depcomp Initia import 2010-03-04 12:13:53 +00:00
example.c fix coverity warnings 2012-06-29 10:02:24 +00:00
hnjalloc.c Initia import 2010-03-04 12:13:53 +00:00
hnjalloc.h Initia import 2010-03-04 12:13:53 +00:00
hyph_en_US.dic sync 2.8.3 into CVS 2012-06-29 07:10:58 +00:00
hyphen.c Fix COMPOUNDHYPHENMIN=1 compound hyphenation 2015-08-28 00:53:59 +02:00
hyphen.h add missing #include <stdio.h> to hyphen.h 2014-06-30 10:22:51 +00:00
hyphen.patch Initia import 2010-03-04 12:13:53 +00:00
hyphen.tex Initia import 2010-03-04 12:13:53 +00:00
install-sh Initia import 2010-03-04 12:13:53 +00:00
lig.awk Initia import 2010-03-04 12:13:53 +00:00
ligpatch.txt Initia import 2010-03-04 12:13:53 +00:00
ltmain.sh fix coverity warnings 2012-06-29 10:02:24 +00:00
missing Initia import 2010-03-04 12:13:53 +00:00
ooopatch.sed NOHYPHEN feature, see README.compound 2010-11-27 02:20:33 +00:00
substrings.c coverity#58283 patterns vs MAXPATHS 2014-09-18 15:42:34 +00:00
substrings.pl Initia import 2010-03-04 12:13:53 +00:00
tbhyphext.sh Initia import 2010-03-04 12:13:53 +00:00
tbhyphext.tex Initia import 2010-03-04 12:13:53 +00:00
test-driver bump for hyphen 2.8.8 2014-09-18 15:47:14 +00:00

README

Hyphen - hyphenation library to use converted TeX hyphenation patterns
 
(C) 1998 Raph Levien
(C) 2001 ALTLinux, Moscow
(C) 2006, 2007, 2008, 2010, 2011 László Németh
 
This was part of libHnj library by Raph Levien.
 
Peter Novodvorsky from ALTLinux cut hyphenation part from libHnj
to use it in OpenOffice.org.
 
Compound word and non-standard hyphenation support by László Németh.
  
License is the original LibHnj license:
LibHnj is dual licensed under LGPL and MPL (see also README.libhnj).

Because LGPL allows GPL relicensing, COPYING contains now 
LGPL/GPL/MPL tri-license for explicit Mozilla source compatibility.

Original Libhnj source with OOo's patches are managed by Rene Engelhard
and Chris Halls at Debian:

http://packages.debian.org/stable/libdevel/libhnj-dev
and http://packages.debian.org/unstable/source/libhnj


OTHER FILES

This distribution is the source of the en_US hyphenation patterns
"hyph_en_US.dic", too. See README_hyph_en_US.txt.

Source files of hyph_en_US.dic in the distribution:

hyphen.tex (en_US hyphenation patterns from plain TeX)

  Source: http://tug.ctan.org/text-archive/macros/plain/base/hyphen.tex

tbhyphext.tex: hyphenation exception log from TugBoat archive

  Source of the hyphenation exception list: 
  http://www.ctan.org/tex-archive/info/digests/tugboat/tb0hyf.tex

  Generated with the hyphenex script
  (http://www.ctan.org/tex-archive/info/digests/tugboat/hyphenex.sh)

  sh hyphenex.sh <tb0hyf.tex >tbhyphext.tex


INSTALLATION

./configure
make
make install

UNIT TESTS (WITH VALGRIND DEBUGGER)

make check
VALGRIND=memcheck make check

USAGE

./example hyph_en_US.dic mywords.txt

or (under Linux)

echo example | ./example hyph_en_US.dic /dev/stdin

NOTE: In the case of Unicode encoded input, convert your words
to lowercase before hyphenation (under UTF-8 console environment):

cat mywords.txt | awk '{print tolower($0)}' >mywordslow.txt

DEVELOPMENT

See README.hyphen for hyphenation algorithm, README.nonstandard
and doc/tb87nemeth.pdf for non-standard hyphenation,
README.compound for compound word hyphenation, and tests/*.

Description of the dictionary format:

First line contains the character encoding (ISO8859-x, UTF-8).

Possible options in the following lines:

LEFTHYPHENMIN num          minimal hyphenation distance from the left word end
RIGHTHYPHENMIN num         minimal hyphation distance from the right word end
COMPOUNDLEFTHYPHENMIN num  min. hyph. dist. from the left compound word boundary
COMPOUNDRIGHTHYPHENMIN num min. hyph. dist. from the right comp. word boundary

hyphenation patterns       see README.* files

NEXTWORD                   separate the two compound sets (see README.compound)

Default values:
Without explicite declarations, hyphenmin fields of dict struct
are zeroes, but in this case the lefthyphenmin and righthyphenmin
will be the default 2 under the hyphenation (for backward compatibility).

Comments

Use percent sign at the beginning of the lines to add comments to your
hpyhenation patterns (after the character encoding in the first line):

% comment

*****************************************************************************
* Warning! Correct working of Libhnj *needs* prepared hyphenation patterns. *

For example, generating hyph_en_US.dic from "hyphen.us" TeX patterns:
    
perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1

or with default LEFTHYPHENMIN and RIGHTHYPHENMIN values:

perl substrings.pl hyphen.us hyph_en_US.dic ISO8859-1 2 3
perl substrings.pl hyphen.gb hyph_en_GB.dic ISO8859-1 3 3
****************************************************************************

OTHERS

Java hyphenation: Peter B. West (Folio project) implements a hyphenator with
non standard hyphenation facilities based on extended Libhnj. The HyFo module
is released in binary form as jar files and in source form as zip files.
See http://sourceforge.net/project/showfiles.php?group_id=119136

László Németh
<nemeth (at) numbertext (dot) org>