Insight 204 USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USERS SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.
Make Your Hotel Reservation for Insight 204! By Marie Kelly Director of Marketing
If youre planning to attend the Insight 204 Symposium in Boston in June, be sure to make your hotel reservation in the next couple of weeks. The deadline for making your reservation at the guaranteed conference price of $195 is May 13th, less than a month away. So dont delay! Visit the Insight 204 Lodging page today and contact the Hyatt as soon as possible. Be sure to tell them that you are attending the Computer Corporation of America event.
If you havent even REGISTERED for Insight yet, what are you waiting for? Registration is free! You should register as soon as possible, particularly if you plan to attend any of the training workshops being held on Wednesday, June 8th. Attendance at these workshops is limited and is on a first come, first served basis.
If you have any questions about Insight 204, please complete the Insight Feedback Form. We will get back to you quickly. We look forward to seeing you in Boston in June!
System 1032 USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USERS SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.
Estimating Key Table Sizes and Disk Access By Tym Stegner
You can estimate the amount of disk space that is required for System 1032 key tables. This article briefly describes the key table structures and lets you predict the approximate cost of building, accessing, and maintaining a given key table. These guidelines are based partly on analysis of the key table structures and partly on measurements conducted on typical database files. The results are only approximations, usually over-estimates.
This article is for the benefit of those users desiring a deeper understanding of the mechanics of key table generation, or the system administrator or database designer allocating space for key tables for a new or modified dataset.
Introducing Key Tables This section briefly describes the structure of a System 1032 attribute index, called a key table, and explains the factors that affect the size of a key table. The adjustments that are under your control are also explained. It states the simplifying assumptions that are used in the formulas. System 1032 key tables use a modified binary-tree (B-tree) structure.
Lowest-Level Key Table Pages From a logical point of view, a System 1032 key table is simply a sorted list of each value that occurs for a single attribute in a dataset; each value is followed by a list of record numbers ($ID) where that value occurs. Indeed, at the lowest level a key table is actually a series of value / record-list pairs. The size of the key table is largely determined by the space required for each value, plus the space required for a typical record number list, multiplied by the number of different values for the attribute, plus expansion space.
In addition to the space containing active information, you can use the NULL LOWER clause of the KEY_DEFAULTS command during dataset definition to specify the amount of null space to be left in each lowest-level key table page for new or modified entries. By default, System 1032 allows 15% null space per key table page.
Higher-Level Key Table Pages System 1032 achieves its retrieval speed by locating a given key table value quickly. To do this, it creates higher-level key table pages, which index the smallest value that occurs in each lower-level key table page. Additional index levels are added as needed until a single root page is reached. This usually results in a tree structure with 20 to 50 (or more) branches at each level, so that an enormously large number of lowest-level key table pages can be efficiently indexed within two to four levels. Since the root page usually remains cached, minimal disk reads are often sufficient to locate any specific data value and its associated record list.
The NULL HIGHER clause of the KEY_DEFAULTS command controls empty space in higher-level pages.
Costs Vary Logarithmically This high root-to-branch, or fan-out factor means that retrieval costs should be approximately proportional to the logarithm of the number of records in the dataset. The fan-out factor represents the ratio of the lower-level to the higher-level branches of the B-tree. For example, locating a record in a dataset of 1,000,000 records may take about twice as much CPU time and about twice the number of disk reads as locating a record in a dataset of only 1,000 records. Because of efficient disk caching, System 1032 often requires even less CPU time than this guideline would suggest.
Assumptions about Your Dataset System 1032 compresses the record list information that is stored in key tables. This compression depends on the actual record numbers that are in a record list for a value, so that the space required for a record list for 100 consecutive records is less than the space required for 100 records scattered randomly throughout a large dataset.
The calculations assume that the values records are distributed uniformly in the dataset. Even so, the calculations used in this document may underestimate key table sizes for small datasets with less than 10,000 records.
Estimating Process for Key Table Size You need to know:
Your null space percentage setting for lowest-level key pages (LNULL) and for higher-level key pages (HNULL). The default value for each parameter is 15. The command SHOW DATASET KEY_DEFAULTS will list the NULL values for the current dataset. If the higher and lower null values are the same, only one value will be shown.
Table 1. Storage requirements for each data type
* 4 if single precision, 8 if double precision
Estimating Lowest-Level Key Table Page Size The following describes the steps necessary to estimate the number of pages required for the lowest-level key table entries:
Estimating Higher-Level Key Table Page Size The following describes the steps to estimate the number of higher-level key table pages needed to index the lowest-level pages as follows:
Estimating Key File Size The total size of the key table file (.DMK) is estimated by adding the pages needed for lowest-level key information (LPAGE) to those needed for higher-level key indexes (HPAGE), plus some fixed overhead.
Example Assume we have a dataset with 1,200,000 records and a keyed TEXT 10 attribute with approximately 30,000 different values, and with KEY_DEFAULTS set to 0% null space. Then:
NREC = 1,200,000 recordsNVAL = 30,000 valuesVBYTE = 10 bytes/valueLNULL = 0 percent nullHNULL = 0 percent nullGives:AREC = 40.0 records per valueRBYTE = 98 bytes per record listLBYTE = 3,240,000 bytes at lowest levelLPAGE = 6,480 pages at lowest levelHFAN = 33 branches per index pageH1PAGE = 196 pages of first-level indexH2PAGE = 6 pages of second-level indexH3PAGE = 1 page of top-level indexHPAGE = 203 pages of higher-level indexingDMK file = 6,683 blocks
Automation of Calculation To generate the estimated key table sizes, I created a Microsoft Excel worksheet, KeyEst.XLS, to perform the calculations, depending on the necessary values specified. Righ click KeyEst.ZIP to download this worksheet.
To use, open the worksheet, enable macros, and enter values for:
Comparing Actual Results Table 2 shows the results of some preliminary testing of this estimating procedure using actual datasets created using a random-number generator. Each line represents a single test, with estimated and actual results.
Table 2. Comparing calculated key table sizes with the actual key table size
Actual data will not have these idealized distributions and will often require less space than estimated.
Model 204 USE OF AND ACCESS TO PRODUCTS AND FEATURES ARE IN ACCORDANCE WITH THE TERMS AND CONDITIONS OF THE USERS SOFTWARE LICENSE. THE PRESENTATION OF MATERIAL HEREIN DOES NOT, IN ANY MANNER, MODIFY SUCH TERMS AND CONDITIONS.
16M Records - Is It Really a Limit? By James Damon
Yes, it is a physical limit. Sixteen million is the maximum number of records that you can store in one, physical Model 204 file. However, since you can combine multiple physical files to form a logical-file group, a new record limit emerges. A file group can be created using up to 256 files. This results in a collection of records, accessible via a single, logical name, approaching 4.3 billion. Many customers have logical databases ranging from 20 million to 100 million records and make extensive use of file groups.
Files defined as part of a group are each treated as a separate entity. Dissimilar files may be defined to be part of a group. That is, there are no requirements that field names, attributes, file parameters or other characteristics be the same between files in a group. However, files are generally similar and often identical, in terms of characteristics, when defined as part of a group. This is especially true concerning performance. You can achieve optimal group performance when you adhere to the following guidelines as closely as possible:
Partitioning is discussed below.
Creating a File Group Although I discuss only permanent groups in this article, most of what is true for permanent groups is also true for temporary and ad-hoc groups. The following table is a simple example of an instance where you might create a file group that involves census data on 100 million households. Since 100 million records do not fit into a single, 16-million record Model 204 file, you must divide or partition the data into files that you could then include in a file group. Lets say that zipcode will be one part of the key and that the values of zipcodes range from 00000 to 99999. So you could partition the records into files according to the ranges of zipcodes.
In this case, ZIPCODE values provide a convenient partitioning scheme. You can sort these records by zipcode and load ten files, each with a particular zipcode range of records. This will be a property of the data that will be very handy as you will see later.
Using the CREATE Command To create a file group, you would issue the following command:
CREATE PERM GROUP CENSUS05 FROM - ZIP00,ZIP10,ZIP20,ZIP30,ZIP40,ZIP50,ZIP60,ZIP70,ZIP80,ZIP90 END CREATE
You can reference the ten physical Model 204 files individually as ZIP00, ZIP10, ZIP20, and so on, or you can reference them as a logical collection of records called CENSUS05.
CCA also recommends that all files in the group be defined with
Physically, What Is a Group? A Model 204 group is a logical construct only. There is no physical data structure that exists anywhere in Model 204, other than the description of the group in CCAGRP and the Group Field Table (GFT). The GFT is built when a group is opened and deleted when the group is closed and therefore, has no permanence. When field names are identical across files, you can delete group definitions and recreate them with different files, without changing any existing applications. The CENSUS05 group now looks like this:
Since ZIPCODE will be the most commonly used field in FIND statements, it should be defined with the ORDERED attribute and should be the same field name in all files.
Each file is separate and distinct from every other file. You can access each as a single Model 204 file, and you can dump and restore each independently of the others. Each file may be comprised of a single dataset or multiple datasets. Some applications may be accessing a particular file indvidually, for read or update, while other applications may be accessing it as a member of the group.
Improving Group Performance Depending on the nature of typical queries against the data and the partitioning scheme you used to populate the files, processing a file group can provide variable performance. In this group, if you want to find records where ZIPCODE is between 80300 and 80399 and number of children equals four, you could write the following request:
BEGINZ8: IN GROUP CENSUS05 -FPC ZIPCODE IS LIKE 803* KIDS=4END
The previous query will be a time consuming request, because each file in the group will be searched to create the found set. Even though ZIPCODE and KIDS are ORDERED, the find will still be expensive because it must be evaluated against each file in the group. Heres where your partitioning scheme comes in handy. You know that zipcodes in the range 80000 to 89999 are in the ZIP80 file so you do not really need to run the find against all the files in the group. Instead you can use the IN GROUP X MEMBER %MEM syntax.
BEGIN%ZIP = 803*IF $SUBSTR(%ZIP,1,1) = 8 THEN %MEM = ZIP80END IFZ8: IN GROUP CENSUS05 MEMBER %MEM -FPC ZIPCODE IS LIKE %ZIP KIDS=4END
Now, instead of searching the Ordered Index in ten very large files, you search the Ordered Index in ZIP80 only. The performance benefits of using the IN GROUP X MEMBER %MEM syntax are enormous and provide the only reasonable approach to use when processing large file groups.
No Partitioning Scheme Is Perfect Of course, there will be cases when, even though you have partitioned the data in this fashion, the partioning scheme will not provide a performance benefit. If you run the following request, all files must be searched because records satisfying the find criteria could be found in any file.
BEGINZ8: IN GROUP CENSUS05 FPC LASTNAME IS LIKE JOHN*END
Summary File groups provide the perfect facility for overcoming the 16-million record limit and for making the management of large numbers of records and large amounts of data more feasible. However, it is important to consider what the primary data access path will be--by zipcode, part number, order number--and then partition the data and load files according to that partitioning scheme. While some types of access will not provide optimal performance regardless of partitioning scheme, if the predominant access path conforms to your partitioning scheme, you will be able to provide superior performance to most applications.