Model 204
PRESENT or NOT PRESENT Using the Ordered Index, Part I: Ordered Character fields
By James Damon
Is a Field There or Not?
There are many reasons why a DBA, application programmer, or end-user may need to know whether a field is present or absent in a set of records. Fields most often go missing during file loads or file reorganizations due to errors in the data being loaded or errors in the logic of the load program itself. Problems can also occur during the course of normal file updates, if the logic necessary to prevent missing fields is not included in all updating programs. When missing data is detected, the full extent of the problem needs to be known: how many and exactly which records are missing the data, or how many records are correct.
Sometimes missing data is not an error condition, but is expected, and you might want a report of records with (or without) a missing field. For example, not all families have young children. Thus, the CHILD field might be missing in some records in a FAMILIES database.
Comparing Ways of Finding Missing Fields
As most User Language programmers know the IS [NOT] PRESENT syntax in a FIND statement or IF clause provides a powerful, yet simple and easy way to detect whether a field is present or absent in a set of records. For example, the following FIND statement can answer the question: Which records and how many are missing the field RECTYPE?
ALL: FIND AND PRINT COUNT RECTYPE IS NOT PRESENT
However, such a statement can be expensive in a large file, since it entails a direct record search of every record. In other words, every Table B record must be examined and scanned from beginning to end looking for an occurrence of the field RECTYPE.
In my small, nearly one-million record file, DSNLIST3, the elapsed time, I/O, and CPU costs were high when using the IS NOT PRESENT syntax compared to using a wildcard and ORDERED CHARACTER search against the Ordered Index.
Figure 1. Using the IS NOT PRESENT syntax
BEGIN FIND AND PRINT COUNT RECTYPE IS NOT PRESENT END *** M204.0179: TABLE B SEARCH IMPLIED FOR FIELD=RECTYPE IN FILE DSNLIST3 126 T REQUEST CPU=12.245 CNCT=8511 DKRD=40542 DKWR=47 SQRD=8 SQWR=35 NTBL=1 QTBL=16 TTBL=3 VTBL=5 PDL=688 CNCT=182 CPU=12232 DKRD=40536 DKWR=45 OUT=3 IN=1 FINDS=1 PCPU=32 RQTM=181789 DIRRCD=999847 DKPR=2040012
The Figure 1 request:
Now, compare the results in Figure 2 using an ORDERED CHARACTER and wildcard pattern match, which required a fraction of the resources.
Figure 2. Using wildcard syntax and ORDERED CHARACTER to determine a field is not present
BEGIN ALL: FIND AND PRINT COUNT END FIND * ISPRES: FIND AND PRINT COUNT RECTYPE IS LIKE * END FIND * PLACE RECORDS IN ALL ON LIST NOTPRES REMOVE RECORDS IN ISPRES FROM LIST NOTPRES * C: COUNT RECORDS ON LIST NOTPRES * PRINT 'FIELD IS NOT PRESENT COUNT: ' WITH COUNT IN C END T REQUEST
999847 999721 FIELD IS NOT PRESENT COUNT: 126 CPU=13.147 CNCT=1569 DKRD=40644 DKWR=45 SQRD=26 SQWR=82 NTBL=4 QTBL=37 STBL=31 TTBL=3 VTBL=9 PDL=964 CNCT=1 CPU=19 DKRD=84 OUT=5 FINDS=2 PCPU=48 RQTM=394 BXNEXT=14 BXFIND=2 DKPR=281
In Table 1, a quick comparison of the statistics generated in Figures 1 and 2 reveals the savings, roughly two to three orders of magnitude, when the search is performed using the Ordered Index instead of a FIND statement with the IS NOT PRESENT condition.
Table 1. Comparing an IS NOT PRESENT to an ORDERED-CHARACTER-wildcard search
Search strategy
CNCT (secs)
CPU (ms)
DKRD
DIRRCD
IS NOT PRESENT syntax
182
12,232
40,536
999,847
Ordered character field, plus wildcard
1
19
84
0
Comparing Ways to Find Records Where a Field Is Present
Conversely, asking the question using the IS PRESENT syntax is equally costly, as shown by comparing Figures 3 and 4.
Figure 3. Using the IS PRESENT syntax
BEGIN FIND AND PRINT COUNT RECTYPE IS PRESENT END *** M204.0179: TABLE B SEARCH IMPLIED FOR FIELD = RECTYPE IN FILE DSNLIST3
999721 CPU=13.395 CNCT=2354 DKRD=40728 DKWR=69 SQRD=23 SQWR=84 NTBL=1 QTBL=16 TTBL=3 VTBL=5 PDL=672 CNCT=99 CPU=12035 DKRD=40515 DKWR=56 OUT=13 FINDS=1 PCPU=120 RQTM=99225 DIRRCD=999847 DKPR=2040013
However, using an ORDERED CHARACTER field and wildcard syntax as shown in Figure 4, again significantly reduces the resources consumed.
Figure 4. Using wildcard syntax for an ORDERED CHARACTER field
03.006 JAN 06 18.30.13 PAGE 3 BEGIN FIND AND PRINT COUNT RECTYPE IS LIKE * END
999721 T REQUEST CPU=0.052 CNCT=876 DKRD=186 DKWR=25 SQRD=14 SQWR=60 NTBL=1 QTBL=13 STBL=2 TTBL=3 VTBL=8 PDL=948 CNCT=4 CPU=13 DKRD=88 OUT=3 IN=1 FINDS=1 PCPU=4 RQTM=4436 BXNEXT=14 BXFIND=2 DKPR=176
The Figure 4 request, which also yields the result of 999,721 records:
In other words, a reduction again of two to three orders of magnitude in resources consumed, as shown in Table 2.
Table 2. Comparing an IS PRESENT to an ORDERED-CHARACTER-wildcard search
CPU(ms)
IS PRESENT syntax
99
12,035
40,515
4
13
88
Increasing Costs with Increasing Number of Unique Values
In Figures 1 through 4, the field RECTYPE has only two values, A and B, so the number of Table D Ordered Index pages, which must be scanned looking for those values and associated record lists or bit patterns, is naturally small. As you might expect, as the number of unique values for a particular field increases, so does the cost of using the Ordered Index in this fashion. However, although the costs are higher, they are still significantly lower than the cost of an entire Table B search.
This is illustrated in Figures 5, 6, and 7. The field ALPHA1, defined as ORDERED CHARACTER OCCURS 1 LENGTH 30, is in each of the 999,847 records in file DSNLIST3 and each value is unique. Figure 5 shows an ANALYZE ALPHA1 command and output, which includes the statistic LEAF=3074, the count of Table D pages required to store all the unique values in file DSNLIST3.
Figure 5. Using an ANALYZE command to determine the number of Table D pages
03.010 JAN 10 09.51.16 PAGE 1
ANALYZE ALPHA1 ROOT NODE VERSION NUMBER = 2557375 *** M204.0005: ANALYZE FIELDNAME = ALPHA1 AVG. OFFSET COMP. KEY PAGE AVG. PAGES ENTRY AREA SIZE AREA USAGE% UNUSED ROOT 1 19 58 197 4 5889 I-NODE 12 265 550 2981 57 2612 LEAF 3074 325 670 4 2349 49 3120
MRIB: IMMEDIATE LIST BITMAP TOTAL ENTRIES RECORDS PAGES SRIB: 999847 *** M204.0003: ANALYZE DONE
Now, let's compare the statistics in Figure 6, which illustrates an ORDERED-CHARACTER-wildcard search to find all records in which ALPHA1 is present with the statistics in Figure 7, which illustrates a FIND statement using the IS PRESENT condition.
Figure 6. Using widcard syntax for field ALPHA1
IN DSNLIST3 BEGIN ALL: FIND AND PRINT COUNT ALPHA1 IS LIKE * END T REQUEST
999847 CPU=160.660 CNCT=855 DKRD=49684 DKWR=53552 SQRD=47 SQWR=240 NTBL=1 QTBL=13 STBL=2 TTBL=3 VTBL=8 PDL=980 CNCT=44 CPU=9868 DKRD=3100 OUT=3 FINDS=1 PCPU=222 RQTM=44222 BXNEXT=4019386 BXFIND=2 BXRFND=19996 DKPR=1026039
Figure 7. Using IS PRESENT syntax for field ALPHA1
IN DSNLIST3 BEGIN ALL: FIND AND PRINT COUNT ALPHA1 IS PRESENT END T REQUEST
999847 CPU=173.863 CNCT=1455 DKRD=90224 DKWR=53602 SQRD=60 SQWR=280 NTBL=1 QTBL=16 TTBL=3 VTBL=5 PDL=672 CNCT=141 CPU=13187 DKRD=40531 DKWR=47 OUT=3 FINDS=1 PCPU=92 RQTM=141059 DIRRCD=999847 DKPR=2040263
Figure 6 indicates that 3100 physical reads were required to perform the ORDERED-CHARACTER-wildcard search. We can see from the LEAF=3074 statistic in Figure 5 that most of the physical reads were Table D pages. Since each value of ALPHA1 is unique, each value entry in the Ordered Index contains the absolute record number of the record containing that value. This eliminates the need for any additional I/O to read other Table D pages to collect record lists or bitmaps to construct the found set.
When you compare the CNCT, CPU and DKRD statistics in Figures 6 and 7, you will note that even in this extreme case of so many unique values, the savings obtained by using the Ordered Index to mimic IS PRESENT, justifies this approach.
In Summary
If you need to know whether a particular field is or is not present in a large set of records, define that field as ORDERED CHARACTER and use the facilities of the Ordered Index as illustrated. This is far preferable to using the IS [NOT] PRESENT condition in a FIND statement. The computing resources consumed are significantly reduced when you use the Ordered Index to implement what is tantamount to an index for the IS [NOT] PRESENT condition. Also, since this type of search is much faster, conflicts for critical file resources are reduced resulting in an overall improvement in system efficiency and end-user response times.
Coming Attractions
Determining the presence or absence of an ORDERED NUMERIC field requires a different approach. For ORDERED NUMERIC fields, the Ordered Index can still be used. I will discuss this approach in the next issue of CCAprint.
Copyright © 2008 Computer Corporation of America. All right reserved. Published in the United States of America.
Contact CCA Webmaster Copyright 2008