grep for WIN32

links: |- index -|- home -|

in page: preamble configuration greputils grep gentests testing problems downloads end

Preamble

For ages and AGES I have had my own Find-All-text-in-file-set, FA4, utility, but have always been fascinated by the unix grep application, which I understand originally stood for 'Global Regular Expression Print', which seems ALWAYS available in unix, linux distributions ...

Some years back I came across a WIN32 port of some of the GNU tools, including 'grep', and thought I had it make ;=)) but unfortunately this port never ran exactly as I had expected, and shows little information when it fails to find your text ... so that pushed me to continue to enhance my own FA4.exe, adding the pcre (Perl Compatible Regular Expression) library, so it too could do 'regular expressions' ...

More recently, May 2008, I came across the CVS source of 'grep'. It appears to be version 2.5.4-cvs (circa Nov, 2007), so I checked it out, and set about compiling it with MSVC. I had already successfully 'made' it in Ubuntu linux ... There were no MSVC build files available, but from the Makefile.am files, it consisted of the static library, grepuitls, and the grep.c source, so was not difficult to make a project in MSVC ...

I decide to keep the MSVC build out of the current directory system of CVS grep, and thus created a new folder, 'build' for this. I have a rough Perl script that reads Makefile.am files, and generates a general MSVC6 DSW/DSP file set. I use this as the start, and did a build using MSVC6 (circa 1998).

Thus quite quickly I have a build set of file, in which one of the compiler defines is /D "HAVE_CONFIG_H=1", since I had noted all the sources start with the code -

#if HAVE_CONFIG_H
# include <config.h>
#endif

Rather than putting this 'config.h' in the root folder, which is where it is normally generated, if there is no general 'include' folder, or in the 'src' folder, I put it only in my special 'build' folder. This keeps it away from the eyes of the *nix programmer, since he/she will want to delete it, and let it be auto-generated, and I want it as part of the distribution, solely for WIN32 builds!

It is handy that each file includes this, since a lot of the kludges necessary to compile open source in native windows can be 'hidden' in here ... Further, to provided some WIN32 'glue' code I also created a winport.c, and winport.h, also in this build folder. This 'winport' file contains my implementation of opendir()/readdir()/closedir(), as well as a few extra diagnostic display and some 'debug' only services.

When this was complete I used MSVC8 (2005), and MSVC9 (2008), to load and convert these MSVC6 dsw/dsp build files to SLN/VCPROJ build files, and completed the build in these two newer environments. Interesting two errors, both in lib/regex.c, came up in MSVC8 that were not present in MSVC6 - (1) there was a name clash of 'errcode', which I changed to 'errcod', and (2) it seems the later runtime libraries no longer contain 'wctype' function, but I applied another work around that already existed in the file.

Naturally I back applied these MSVC8 changes into the MSVC6 build to make sure it too accepted the changes.


top

configuration - a manual config.h

As you may know, config.h is one of the important files often generated by the automake system in unix. It is an anathema (without the religious under tones) to unix, linux, *nix programmers that in Windows (a) we do NOT have an automake system, except if you install a unix emulator like cygwin, and (b) we only ever need to create such a config.h ONCE in a lifetime, and we can 'know' what to put in it ...

It traditionally contains a whole bunch of defines, and perhaps some 'un-defines' like -

#define HAVE_WCTYPE_H 1
#define HAVE_WCHAR_H 1
#define HAVE_BTOWC 1
#undef HAVE_UNISTD_H

These defines, or un-defines, are auto-generated in unix after the automake system has determined if this or that system header file exists. This is because unix seems to have a more 'flexible' set of system headers, while we in WIN32 ALWAYS have the same set, as installed when we installed MSVC, and/or a subsequent Platform SDK (PSDK), thus can 'fix' all these defines once, and it is likely to stay correct for YEARS ...

And then in the code there will be switches to include or exclude a particular header, like -

#if HAVE_WCHAR_H
# include <wchar.h>
#endif
#if HAVE_UNISTD_H
# include <unistd.h>
#endif

In projects that have already been ported to multiple other environments, there will sometimes be an #else where the missing used functionality of the 'missing' header is implemented ...

This is also one of the few open sources where there seems to have been a determined attempt to support multi-byte character sets. To enable full support for this you need to define _ALL_ of - HAVE_ISWCTYPE, HAVE_LOCALE_H, HAVE_MBRLEN, HAVE_MBRTOWC, HAVE_WCHAR_H, HAVE_WCRTOMB, HAVE_WCSCOLL, HAVE_WCTYPE, HAVE_WCTYPE_H, HAVE_STDLIB_H, and MB_CUR_MAX.

While I have define some, perhaps most, of these I have NOT YET fully turned on this multi-byte character set support. Of course, in Windows we can often support multi-byte characters sets by simply defining UNICODE, or _UNICODE, but this seldom works successfully in unix open sources, since many things a 'forcefully' defined as say a 'char *', rather than the optional 'TCHAR *' or 'PTSTR' that we can using in windows.

And of course, along with such compile time single or multi-byte declarations, we have to use 'lstrlen', 'wsprintf', etc, to thus use either 'strlen' or 'wcslen', that is the ANSI character or unicode, non-ANSI version, etc, as appropriate.


top

greputils - static utility library - greputils.lib

This static library is essentially all the sources in the 'lib' folder. Care should be taken, however, since some source files, like say strtol.c is actually included through strtoul.c, strtoumax.c also includes strtoul.c, etc, so care has to be taken not to get 'duplicates'! This took some time and effort to get exactly right, if I have done that ;=))

The compile still returns some warnings ... but no errors ;=))


top

grep - the main application - grepw32.exe

So as to NOT confuse it with any other 'grep' I may have, I have called it grepw32 - as in 'grep for WIN32'. And since I can read the egrep and fgrep applications are depreciated, these were NOT built, since their specific functionality is now ALL included in grep - grepw32.

 The compile still returns some warnings ... but no errors ;=))


top

gentests - a utility application - gentests.exe

The source contains a 'test' folder, where a considerable number of test cases have been put. Some of these 'cases' are actually generated using another unix tool - awk. I tried very hard to 'convert' all the test shell scripts to batch file, but ran into some real problems. The main one being that if you do -

echo "{"|grepw32 -E -e "{"

Then stdin will find '"{"', rather than just the '{' character, which is more than is required, while in the second case, the command input will just be the single '{' character, with the double quotes correctly stripped. But these double quotes are required on an 'echo' like ...

echo "|"|grepw32 -E -e "|"

This, and other problems, lead me to write this utility application, that takes some of the test cases input files, and generates a suitable batch file to run the test. In general, this was quite successful, but there are still some errors that do NOT occur in the pure unix make and tests ...


top

Regression Testing - grep/tests $ make check-TESTS for WIN32

In WIN32, normally, to run all the 'regression tests', it is only necessary to change into the build/tests folder, and run 'aRunAllTests.bat'. Using the above tool, and through a batch file gt.bat, spencer1.tests, spencer2.tests and bre.tests are converted to spencer1.bat, spencer2.bat and bre.bat respectively during this testing.

The file build/test/README.tests gives a further explanation. As it explains, not all tests will pass in WIN32 ;=((

I am still trying to fully get my head around 'locale' in unix, and 'code pages' in windows ;=)) The fmbtest.sh uses this, as follows :-

# If cs_CZ.UTF-8 locale doesn't work, skip this test silently
 LC_ALL=cs_CZ.UTF-8 locale -k LC_CTYPE 2>/dev/null | ${GREP} -q charmap.*UTF-8 \
 || exit 77

When I run, in the grep 'tests' folder -

 $ make check-TESTS

All the tests show PASS, except fmbtest.sh which shows SKIP! That is 77 is returned. When I run that command in my Ubuntu shell, I get an output that includes charmap="ANSI_X3.4-1968" ... Why???

After re-writing the Makefile check-TESTS: into a simple tests.sh, and removing the LC_ALL=C from the environment, and reducing the above line to just 'locale -k LC_CTYPE 2>/dev/null | ${GREP} -q charmap.*UTF-8 || exit 77', fmbtest.sh _WAS_ run, but then it also showed FAIL: fmbtest.sh, JUST LIKE IN WINDOWS ;=/ - In fact, some 17 of the 20 or so tests within fmbtest.sh, actually my fmbtest2.sh, FAIL in Ubuntu!

Why do the tests use cs_CZ (Czech cs_CZ.ISO8859-2?) in the first place is a big mystery ;=)) The default LANG in my Ubuntu linux is en_US.UTF-8, and why this can not be used is a mystery to be solved at a later time.

In Windows, the problem with my fmbtest.bat file I constructed, closely equivalent to fmbtest.sh, is the passing of hi-bit-set characters on the command line. If the default code page 437 (check with 'chcp') is used, characters like 'ÄŒas', or '&Auml;&OElig;as' in unicode HTML, namely hex 'C4 8C 61 73', get 'translated' by the command interpreter, and arrive into grepw32 as hex '2B(+) E4 2B(+) C6 61(a) 73(s)'. Note the C4 became 2BE4, and 8C became 2BC6 ???

However, if I change the active code page to 1252 (Latin I), before running the test, then these same 4 characters are 'translated' into 'C3 84 C5 92 61(a) 73(s)', WHICH SUCCEEDS ;=)) So, in the meantime, I have added the 'chcp 1252' to my build/tests/fmbtest.bat file, and default back to 437, for test #3 and #8, and now it completes without error, but not all tests were converted ...

Another point about such hit-bit characters, and using editors such as the normally the brain dead windows notepad.exe, the above 'ÄŒas', can get displayed as 'Čas', or '&#268;as' in unicode HTML ... and other entities likewise 'translated-on-display' ... Notepad does this by adding a 'Byte Order Mark', a BOM, of hex 'EF BB BF', the UTF-8 UNICODE BOM, plus a CR & LF, to the head of the file. This does not change the values, but does effect how they are displayed in editors that 'recognize' a BOM ... so you need to take care to use an editor that does NOT add a BOM!

 


top

Problems

Some problems noted during testing ...

1. Redirection to a file in the same folder with multiple file search

Using a command like -

> grepw32 -r "this" * > tempf.txt

This seems to be related to trying to open the tempf.txt file ... there was no problem with -

> grepw32 -r "this" * > ..\tempf.txt

I need to investigate this further, but suspect it is the 'interesting' open file code -

   while ((desc = open (file, O_RDONLY)) < 0 && errno == EINTR)
      continue;

If 'open' is returning EINTR, which I understand sort of means 'busy', interrupted, then it could get stuck in this loop FOREVER ... maybe in windows, this possible permanent loop should be eliminated ... but as stated, this needs further investigation.

2. Outputs redirected to a file

This is more a personal GRIPE than a 'problem' ;=))

Outputs to stdout have only a LF (\n) as the line ending. This means if the results are redirected to a file then that file looks a mess in editors that do not 'understand' unix line endings, like simple notepad. Thankfully most other editors can handle it without problems, but at some stage 'grepw32' should correctly output the Windows line ending, CR + LF, that is '\r\n', to be a 'correct' Windows application ...

3. process tried to write to a nonexistent pipe

And can be seen in the section, gentests, some considerable effort has been made to rewrite the shell script 'regression' tests into batch file form, and this has been quite successful, but with some failures. However, aside from the tests that 'fail', there is an occasional stderr message -

The process tried to write to a nonexistent pipe.

I have not exactly tracked down to where this message comes from, if it is from grepw32, and can not seem to re-direct it, thus it persists. For the moment am ignoring it ...


top

Downloads

Some downloads ...  The grepw32 'e' zip contains just a WIN32 executable, and the '-' contains the FULL source, including my special 'build' folder with files NOT in the CVS source, and a grep.diffnn.txt file to patch the CVS source. The patch file has also been included separately.

RUN EXECUTABLES AT YOUR OWN RISK!

Date Zip Size Md5
2008/05/30 grepw32e02.zip 72,626 67f593fe2ebd9e765e9e1afba6659d12
2008/05/30 grepw32-02.zip 952,431 8e8724fce969b368da2e5a7200a82611
2008/05/30 grep.diff02.txt 10,578 patch text file only

top

Geoff
30 May, 2008


top

checked by tidy  Valid HTML 4.01 Transitional