Computer Corporation of America
|
Feedback
Search CCA:
   
USA CCA
CCA Products
CCA Customer Support
CCA Resources
CCA - Company
CCAPRINT: A Newsletter for Model 204® and System 1032® Users
April 10, 2005
     
Insight 204: Make Your Hotel Reservation for Insight 204!
System 1032: Estimating Key Table Sizes and Disk Access Printer-friendly version
16M Records - Is It Really a Limit? Printer-friendly version

Insight 204
USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USER’S SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.

Make Your Hotel Reservation for Insight 204!
By Marie Kelly
Director of Marketing

If you’re planning to attend the Insight 204 Symposium in Boston in June, be sure to make your hotel reservation in the next couple of weeks. The deadline for making your reservation at the guaranteed conference price of $195 is May 13th, less than a month away. So don’t delay! Visit the Insight 204 Lodging page today and contact the Hyatt as soon as possible. Be sure to tell them that you are attending the Computer Corporation of America event.

If you haven’t even REGISTERED for Insight yet, what are you waiting for? Registration is free! You should register as soon as possible, particularly if you plan to attend any of the training workshops being held on Wednesday, June 8th. Attendance at these workshops is limited and is on a “first come, first served” basis.

If you have any questions about Insight 204, please complete the Insight Feedback Form. We will get back to you quickly. We look forward to seeing you in Boston in June!


System 1032
USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USER’S SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.

Estimating Key Table Sizes and Disk Access
By Tym Stegner

Tym

You can estimate the amount of disk space that is required for System 1032 key tables. This article briefly describes the key table structures and lets you predict the approximate cost of building, accessing, and maintaining a given key table. These guidelines are based partly on analysis of the key table structures and partly on measurements conducted on typical database files. The results are only approximations, usually over-estimates.

This article is for the benefit of those users desiring a deeper understanding of the mechanics of key table generation, or the system administrator or database designer allocating space for key tables for a new or modified dataset.

Introducing Key Tables
This section briefly describes the structure of a System 1032 attribute index, called a key table, and explains the factors that affect the size of a key table. The adjustments that are under your control are also explained. It states the simplifying assumptions that are used in the formulas. System 1032 key tables use a modified binary-tree (B-tree) structure.

Lowest-Level Key Table Pages
From a logical point of view, a System 1032 key table is simply a sorted list of each value that occurs for a single attribute in a dataset; each value is followed by a list of record numbers ($ID) where that value occurs. Indeed, at the lowest level a key table is actually a series of value / record-list pairs. The size of the key table is largely determined by the space required for each value, plus the space required for a typical record number list, multiplied by the number of different values for the attribute, plus expansion space.

In addition to the space containing active information, you can use the NULL LOWER clause of the KEY_DEFAULTS command during dataset definition to specify the amount of null space to be left in each lowest-level key table page for new or modified entries. By default, System 1032 allows 15% null space per key table page.

Higher-Level Key Table Pages
System 1032 achieves its retrieval speed by locating a given key table value quickly. To do this, it creates higher-level key table pages, which index the smallest value that occurs in each lower-level key table page. Additional index levels are added as needed until a single root page is reached. This usually results in a tree structure with 20 to 50 (or more) branches at each level, so that an enormously large number of lowest-level key table pages can be efficiently indexed within two to four levels. Since the root page usually remains cached, minimal disk reads are often sufficient to locate any specific data value and its associated record list.

The NULL HIGHER clause of the KEY_DEFAULTS command controls empty space in higher-level pages.

Costs Vary Logarithmically
This high root-to-branch, or fan-out factor means that retrieval costs should be approximately proportional to the logarithm of the number of records in the dataset. The fan-out factor represents the ratio of the lower-level to the higher-level branches of the B-tree. For example, locating a record in a dataset of 1,000,000 records may take about twice as much CPU time and about twice the number of disk reads as locating a record in a dataset of only 1,000 records. Because of efficient disk caching, System 1032 often requires even less CPU time than this guideline would suggest.

Assumptions about Your Dataset
System 1032 compresses the record list information that is stored in key tables. This compression depends on the actual record numbers that are in a record list for a value, so that the space required for a record list for 100 consecutive records is less than the space required for 100 records scattered randomly throughout a large dataset.

The calculations assume that the value’s records are distributed uniformly in the dataset. Even so, the calculations used in this document may underestimate key table sizes for small datasets with less than 10,000 records.

Estimating Process for Key Table Size
You need to know:

v The total number of records in your dataset (NREC).
v The number of unique values for the selected attribute (NVAL).
v

Your null space percentage setting for lowest-level key pages (LNULL) and for higher-level key pages (HNULL).

The default value for each parameter is 15.

The command SHOW DATASET KEY_DEFAULTS will list the NULL values for the current dataset. If the higher and lower null values are the same, only one value will be shown.

v The number of bytes required to store an attribute value in the key table (VBYTE) based on the data type of the value, as shown in Table 1.

Table 1. Storage requirements for each data type

Data Type Value Size in Bytes
Date, Date_Time 8
Decimal Precision / (2 + 1)
Integer 4 or 8*
Logical 4
Real 4 or 8*
Text Attribute length (max 80)
Text Varying Substring length (max 80) or first 40 characters
Time, Time_Span 8

* 4 if single precision, 8 if double precision

Estimating Lowest-Level Key Table Page Size
The following describes the steps necessary to estimate the number of pages required for the lowest-level key table entries:

  1. Compute the average number of records per value (AREC) as the total number of records divided by the number of different values:
  2. Estimate the byte size of the record list (RBYTE) for the average number of records per value (AREC).
  3. Estimate the size of all value / record-list pairs (LBYTE) by multiplying the number of different values by the size of each value, plus the estimated size of each record list.
  4. Estimate the number of lowest-level key pages (LPAGE) by dividing the size of all value / record-list pairs (LBYTE) by the number of available bytes per lowest-level page.

Estimating Higher-Level Key Table Page Size
The following describes the steps to estimate the number of higher-level key table pages needed to index the lowest-level pages as follows:

  1. Compute the approximate fan-out factor at each higher level (HFAN) by dividing the available bytes per higher-level page by the size of each index entry.
  2. Iteratively estimate the number of higher-level key pages (HnPAGE) by dividing the number of lowest-level key pages by the fan-out factor (rounded up).
    The iterative process continues until the top-level key page is reached, having a value of 1.
  3. Estimate the total number of higher-level key pages (HPAGE) by summing the intermediate levels.

Estimating Key File Size
The total size of the key table file (.DMK) is estimated by adding the pages needed for lowest-level key information (LPAGE) to those needed for higher-level key indexes (HPAGE), plus some fixed overhead.

Example
Assume we have a dataset with 1,200,000 records and a keyed TEXT 10 attribute with approximately 30,000 different values, and with KEY_DEFAULTS set to 0% null space. Then:

NREC = 1,200,000 records
NVAL = 30,000 values
VBYTE = 10 bytes/value
LNULL = 0 percent null
HNULL = 0 percent null
Gives:
AREC = 40.0 records per value
RBYTE = 98 bytes per record list
LBYTE = 3,240,000 bytes at lowest level
LPAGE = 6,480 pages at lowest level
HFAN = 33 branches per index page
H1PAGE = 196 pages of first-level index
H2PAGE = 6 pages of second-level index
H3PAGE = 1 page of top-level index
HPAGE = 203 pages of higher-level indexing
DMK file = 6,683 blocks

Automation of Calculation
To generate the estimated key table sizes, I created a Microsoft Excel worksheet, KeyEst.XLS, to perform the calculations, depending on the necessary values specified. Righ click KeyEst.ZIP to download this worksheet.

To use, open the worksheet, enable macros, and enter values for:

v Number of dataset records
v Lower-level table null space percentage
v Higher-level table null space percentage
v Number of keyed attributes in the dataset
If more than one keyed attribute is to be calculated, advance the cursor out of the ATRN cell, and press the Attrs button toward the bottom of the page to generate additional attribute columns. The previously entered information is copied across for you. LNULL and HNULL can be further modified if you have different values set for different attributes.
v For each keyed attribute, enter:

Attribute length in bytes, according to Table 1
The number of unique values for the attribute
The worksheet automatically calculates each attribute’s DMK size, and also provides a running total of all DMK files.

Comparing Actual Results
Table 2 shows the results of some preliminary testing of this estimating procedure using actual datasets created using a random-number generator. Each line represents a single test, with estimated and actual results.

  1. The first column shows the number of records in the dataset.
  2. The second column shows the data type (all integers are single precision).
  3. The third column shows the number of potential different values from which random values were selected for each record. For any record, each value is equally likely; however, every possible value is not necessarily represented in a record. That is, the value for any given record is randomly selected with equal probability from the interval [1 : NVALS]. The actual number of occurrences per value can thus be expected to fall into a normal distribution, unless the number of possible values is large compared to the number of records in the dataset. In the latter situation, the actual number of different values is likely to be somewhat smaller than the potential number, which, of course, is always limited by the dataset size.
  4. The fourth column shows the estimated key table size in pages using the formulas given in this article.
  5. The fifth shows the actual size for each test.
  6. The sixth shows the percentage of over-estimation.

Table 2. Comparing calculated key table sizes with the actual key table size

No. of records Data type Potentialvalues Key table size estimated Actual %Difference
1,000 Integer 10 4 4 10.0%
1,000 Integer 100 6 7 -14.3%
1,000 Integer 1,000 14 13 7.7%
10,000 Integer 10 23 23 0.0%
10,000 Integer 100 35 36 -2.8%
10,000 Integer 1,000 56 66 -15.2%
100,000 Integer 10 219 219 0.0%
100,000 Integer 100 338 333 1.5%
100,000 Integer 1,000 463 457 1.3%
250,000 Integer 6 479 455 5.3%
250,000 Text 3 194 922 918 0.4%
250,000 Text 3 1,000 1,137 1,103 3.1%
250,000 Text 6 1,000,000 5,145 5,097 0.9%
500,000 Integer 6 957 909 5.3%
500,000 Text 3 194 1,841 1,830 0.6%
500,000 Text 3 1,000 2,264 2,180 3.9%
500,000 Text 6 1,000,000 11,574 9,567 21.0%
1,000,000 Integer 6 1,911 1,817 5.2%
1,000,000 Text 3 194 3,680 3,652 0.8%
1,000,000 Text 3 1,000 4,520 4,336 4.2%
1,000,000 Text 6 1,000,000 23,146 17,338 33.5%

Actual data will not have these idealized distributions and will often require less space than estimated.


Model 204
USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USER’S SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.

16M Records - Is It Really a Limit?
By James Damon



Yes, it is a physical limit. Sixteen million is the maximum number of records that you can store in one, physical Model 204 file. However, since you can combine multiple physical files to form a logical-file group, a new record limit emerges. A file group can be created using up to 256 files. This results in a collection of records, accessible via a single, logical name, approaching 4.3 billion. Many customers have logical databases ranging from 20 million to 100 million records and make extensive use of file groups.

Files defined as part of a group are each treated as a separate entity. Dissimilar files may be defined to be part of a group. That is, there are no requirements that field names, attributes, file parameters or other characteristics be the same between files in a group. However, files are generally similar and often identical, in terms of characteristics, when defined as part of a group. This is especially true concerning performance. You can achieve optimal group performance when you adhere to the following guidelines as closely as possible:

  1. Field names and attributes should be nearly identical across all files.
  2. The data should, when possible, be partitioned by the field that will be the most frequently used in FIND statements

Partitioning is discussed below.

Creating a File Group
Although I discuss only permanent groups in this article, most of what is true for permanent groups is also true for temporary and ad-hoc groups. The following table is a simple example of an instance where you might create a file group that involves census data on 100 million households. Since 100 million records do not fit into a single, 16-million record Model 204 file, you must divide or partition the data into files that you could then include in a file group. Let’s say that zipcode will be one part of the key and that the values of zipcodes range from 00000 to 99999. So you could partition the records into files according to the ranges of zipcodes.

Filename Zip code range Number of records
ZIP00 00000-09999 12,000,000
ZIP10 10000-19999 7,000,000
ZIP20 20000-29999 4,500,000
ZIP30 And so on... And so on...

In this case, ZIPCODE values provide a convenient partitioning scheme. You can sort these records by zipcode and load ten files, each with a particular zipcode range of records. This will be a property of the data that will be very handy as you will see later.

Using the CREATE Command
To create a file group, you would issue the following command:

CREATE PERM GROUP CENSUS05 FROM - 
ZIP00,ZIP10,ZIP20,ZIP30,ZIP40,ZIP50,ZIP60,ZIP70,ZIP80,ZIP90 
END CREATE

You can reference the ten physical Model 204 files individually as ZIP00, ZIP10, ZIP20, and so on, or you can reference them as a logical collection of records called CENSUS05.

CCA also recommends that all files in the group be defined with

  1. The same values for FRCOVPT (usually X’C0’)
  2. TBO enabled

Physically, What Is a Group?
A Model 204 group is a logical construct only. There is no physical data structure that exists anywhere in Model 204, other than the description of the group in CCAGRP and the Group Field Table (GFT). The GFT is built when a group is opened and deleted when the group is closed and therefore, has no permanence. When field names are identical across files, you can delete group definitions and recreate them with different files, without changing any existing applications. The CENSUS05 group now looks like this:

Since ZIPCODE will be the most commonly used field in FIND statements, it should be defined with the ORDERED attribute and should be the same field name in all files.

Each file is separate and distinct from every other file. You can access each as a single Model 204 file, and you can dump and restore each independently of the others. Each file may be comprised of a single dataset or multiple datasets. Some applications may be accessing a particular file indvidually, for read or update, while other applications may be accessing it as a member of the group.

Improving Group Performance
Depending on the nature of typical queries against the data and the partitioning scheme you used to populate the files, processing a file group can provide variable performance. In this group, if you want to find records where ZIPCODE is between 80300 and 80399 and number of children equals four, you could write the following request:

BEGIN
Z8: IN GROUP CENSUS05 -
FPC ZIPCODE IS LIKE ‘803*’
KIDS=4
END

The previous query will be a time consuming request, because each file in the group will be searched to create the found set. Even though ZIPCODE and KIDS are ORDERED, the find will still be expensive because it must be evaluated against each file in the group. Here’s where your partitioning scheme comes in handy. You know that zipcodes in the range 80000 to 89999 are in the ZIP80 file so you do not really need to run the find against all the files in the group. Instead you can use the IN GROUP X MEMBER %MEM syntax.

BEGIN
%ZIP = ‘803*’
IF $SUBSTR(%ZIP,1,1) = ‘8’ THEN
%MEM = ‘ZIP80’
END IF
Z8: IN GROUP CENSUS05 MEMBER %MEM -
FPC ZIPCODE IS LIKE %ZIP
KIDS=4
END

Now, instead of searching the Ordered Index in ten very large files, you search the Ordered Index in ZIP80 only. The performance benefits of using the IN GROUP X MEMBER %MEM syntax are enormous and provide the only reasonable approach to use when processing large file groups.

No Partitioning Scheme Is Perfect
Of course, there will be cases when, even though you have partitioned the data in this fashion, the partioning scheme will not provide a performance benefit. If you run the following request, all files must be searched because records satisfying the find criteria could be found in any file.

BEGIN
Z8: IN GROUP CENSUS05 –
FPC LASTNAME IS LIKE ‘JOHN*’
END

Summary
File groups provide the perfect facility for overcoming the 16-million record limit and for making the management of large numbers of records and large amounts of data more feasible. However, it is important to consider what the primary data access path will be--by zipcode, part number, order number--and then partition the data and load files according to that partitioning scheme. While some types of access will not provide optimal performance regardless of partitioning scheme, if the predominant access path conforms to your partitioning scheme, you will be able to provide superior performance to most applications.


Contact CCA Webmaster
Copyright 2008