IEEE P1003.2 Draft 11.2 - September 1991 Copyright (c) 1991 by the Institute of Electrical and Electronics Engineers, Inc. 345 East 47th Street New York, NY 10017, USA All rights reserved as an unpublished work. This is an unapproved and unpublished IEEE Standards Draft, subject to change. The publication, distribution, or copying of this draft, as well as all derivative works based on this draft, is expressly prohibited except as set forth below. Permission is hereby granted for IEEE Standards Committee participants to reproduce this document for purposes of IEEE standardization activities only, and subject to the restrictions contained herein. Permission is hereby also granted for member bodies and technical committees of ISO and IEC to reproduce this document for purposes of developing a national position, subject to the restrictions contained herein. Permission is hereby also granted to the preceding entities to make limited copies of this document in an electronic form only for the stated activities. The following restrictions apply to reproducing or transmitting the document in any form: 1) all copies or portions thereof must identify the document's IEEE project number and draft number, and must be accompanied by this entire notice in a prominent location; 2) no portion of this document may be redistributed in any modified or abridged form without the prior approval of the IEEE Standards Department. Other entities seeking permission to reproduce this document, or any portion thereof, for standardization or other activities, must contact the IEEE Standards Department for the appropriate license. Use of information contained in this unapproved draft is at your own risk. IEEE Standards Department Copyright and Permissions 445 Hoes Lane, P.O. Box 1331 Piscataway, NJ 08855-1331, USA +1 (908) 562-3800 +1 (908) 562-1571 [FAX] P1003.2 Draft 11.2 ISO/IEC CD 9945-2.2 STANDARDS PROJECT Draft Standard for Information Technology -- Portable Operating System Interface (POSIX) Part 2: Shell and Utilities Sponsor Technical Committee on Operating Systems and Application Environments of the IEEE Computer Society Work Item Number: JTC 1.22.21.2 Abstract: ISO/IEC 9945-2: 199x (IEEE Std 1003.2-199x) is part of the POSIX series of standards for applications and user interfaces to open systems. It defines the applications interface to a shell command language and a set of utility programs for complex data manipulation. Keywords: API, application portability, data processing, open systems, operating system, portable application, POSIX, shell and utilities P1003.2 / D11.2 September 1991 Copyright (c) 1991 by the Institute of Electrical and Electronics Engineers, Inc. 345 East 47th Street New York, NY 10017, USA All rights reserved. _T_h_i_s _i_s _a_n _u_n_a_p_p_r_o_v_e_d _I_E_E_E _S_t_a_n_d_a_r_d_s _D_r_a_f_t, _s_u_b_j_e_c_t _t_o _c_h_a_n_g_e. _P_e_r_m_i_s_s_i_o_n _i_s _h_e_r_e_b_y _g_r_a_n_t_e_d _f_o_r _I_E_E_E _S_t_a_n_d_a_r_d_s _C_o_m_m_i_t_t_e_e _p_a_r_t_i_c_i_p_a_n_t_s _t_o _r_e_p_r_o_d_u_c_e _t_h_i_s _d_o_c_u_m_e_n_t _f_o_r _p_u_r_p_o_s_e_s _o_f _I_E_E_E _s_t_a_n_d_a_r_d_i_z_a_t_i_o_n _a_c_t_i_v_i_t_i_e_s. _P_e_r_m_i_s_s_i_o_n _i_s _a_l_s_o _g_r_a_n_t_e_d _f_o_r _m_e_m_b_e_r _b_o_d_i_e_s _a_n_d _t_e_c_h_n_i_c_a_l _c_o_m_m_i_t_t_e_e_s _o_f _I_S_O _a_n_d _I_E_C _t_o _r_e_p_r_o_d_u_c_e _t_h_i_s _d_o_c_u_m_e_n_t _f_o_r _p_u_r_p_o_s_e_s _o_f _d_e_v_e_l_o_p_i_n_g _a _n_a_t_i_o_n_a_l _p_o_s_i_t_i_o_n. _O_t_h_e_r _e_n_t_i_t_i_e_s _s_e_e_k_i_n_g _p_e_r_m_i_s_s_i_o_n _t_o _r_e_p_r_o_d_u_c_e _t_h_i_s _d_o_c_u_m_e_n_t _f_o_r _s_t_a_n_d_a_r_d_i_z_a_t_i_o_n _o_r _o_t_h_e_r _a_c_t_i_v_i_t_i_e_s, _o_r _t_o _r_e_p_r_o_d_u_c_e _p_o_r_t_i_o_n_s _o_f _t_h_i_s _d_o_c_u_m_e_n_t _f_o_r _t_h_e_s_e _o_r _o_t_h_e_r _u_s_e_s, _m_u_s_t _c_o_n_t_a_c_t _t_h_e _I_E_E_E _S_t_a_n_d_a_r_d_s _D_e_p_a_r_t_m_e_n_t _f_o_r _t_h_e _a_p_p_r_o_p_r_i_a_t_e _l_i_c_e_n_s_e. _U_s_e _o_f _i_n_f_o_r_m_a_t_i_o_n _c_o_n_t_a_i_n_e_d _i_n _t_h_i_s _u_n_a_p_p_r_o_v_e_d _d_r_a_f_t _i_s _a_t _y_o_u_r _o_w_n _r_i_s_k. IEEE Standards Department Copyright and Permissions 445 Hoes Lane, P.O. Box 1331 Piscataway, NJ 08855-1331, USA +1 (908) 562-3800 +1 (908) 562-1571 [FAX] _S_e_p_t_e_m_b_e_r _1_9_9_1 _S_H _X_X_X_X_X BEGIN_RATIONALE _E_d_i_t_o_r'_s _N_o_t_e_s The IEEE ballot for Draft 11.2 is due at the IEEE Standards Office on 2 _2222_1111 _OOOO_cccc_tttt_oooo_bbbb_eeee_rrrr _1111_9999_9999_1111. You are also asked to e-mail any balloting comments to 2 me: hlj@posix.com. Please read the balloting instructions in Annex G. 2 This document is also registered as ISO/IEC CD 9945-2.2. The 2 international balloting period is unrelated to the IEEE balloting. 2 Member bodies, please consult any accompanying materials from SC22. 2 Also, please read the remainder of these Editor Notes to see explanations 2 of stylistic differences between a draft and the final standard 2 (copyright notices, inline rationale, etc.). 2 The IEEE balloting will be on hiatus during the international balloting 2 period, which is probably scheduled to complete at the May 1992 WG15 2 meeting. This is in accordance with the WG15 Synchronization Plan, which 2 calls for coordinated balloting to result in the approval of an IEEE/ANSI 2 standard that is identical to the ISO/IEC Draft International Standard 2 (DIS). There will be a final recirculation of a full draft (12) to the 2 IEEE balloting group before it is sent to the Standards Board. 2 This section will not appear in the final document. It is used for 2 editorial comments concerning this draft. Draft 11.2 is the fifth 2 recirculation of the balloting process that began in December 1988 with 2 Draft 8. Please consult Annex G and the cover letter for the ballot that accompanied this draft for information on how the recirculation is accomplished. This draft uses small numbers in the right margin in lieu of change bars. 2 ``2'' denotes changes from Draft 11.1 to Draft 11.2. ``1'' denotes 2 changes from Draft 11 to Draft 11.1. All diff-marks prior to Draft 11.1 1 have been removed. Trivial informative (i.e., non-normative) changes and purely editorial changes such as grammar, spelling, or cross references are not diff-marked. There are two versions of Draft 11.2 in circulation. The full printed 2 version was sent for SC22 balloting and is also available from the IEEE 2 for a duplication fee [call (800) 678-IEEE or +1 (908) 981-1393 outside 2 the US]. The version sent to the IEEE balloting group consists (mostly) 1 of pages containing normative changes. This was done to focus balloting 1 group attention on the changes being balloted and to reduce costs and 1 administrative time. The changes-only version contains a few handwritten 1 pointers in the margins to show context where it would not be obvious; 1 numbers near the normal page numbers show what the corresponding Draft 11 1 page number would be. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. The following minor global changes have been made without diff-marks: - Instances of the verbs ``print,'' ``report,'' ``display,'' ``issue,'' and ``list'' are being changed to ``write'' as part of a general cleanup related to the UPE, where ``write'' and ``display'' have precise meanings. This is probably not completed and will continue throughout ballot resolution and the final editing process. ISO and IEEE have tightened up the requirements for the use of ``shall.'' We have been directed that all sentences that are currently declarative must be changed to use the ``shall'' form if they pose a requirement: ``The status is zero'' -> ``The status shall be zero.'' One specific instance of this was changing ``The following options/operands are available'' to ``The following options/operands shall be supported by the implementation.'' Another: ``The foo utility follows the utility argument syntax standard described in 2.11.2'' to ``The foo utility shall conform to the utility argument syntax guidelines described in 2.10.2.'' It is a tedious process to do all these translations and they are not complete. They will completed on a draft-by-draft basis. In the meantime, please assume that all declarative sentences mean to use ``shall'' and treat them as either implementation or application requirements unless they specifically say ``may,'' ``should,'' or ``can.'' The rationale text for all the sections has been temporarily moved from Annex E and interspersed with the appropriate sections. The rationale sections are identified with the phrase ``(_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2)'' in the heading. This colocation of rationale with its accompanying text was done to encourage the Technical Reviewers to maintain the rationale text, as well as provide explanations to the reviewers and balloters. Not all of the Rationale sections have contents as of this draft. The empty sections may be partially distracting, but we feel it is imperative to keep them there to encourage the Technical Reviewers to provide rationale as needed. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. Please report typographical errors to: Hal Jespersen POSIX Software Group 447 Lakeview Way Redwood City, CA 94062 +1 (415) 364-3410 FAX: +1 (415) 364-4498 Email: hlj@Posix.COM (_E_l_e_c_t_r_o_n_i_c _m_a_i_l _i_s _p_r_e_f_e_r_r_e_d.) The copying and distribution of IEEE balloting drafts is accomplished by the Standards Office. To report problems with reproduction of your copy, 2 contact: 2 Anna Kaczmarek 2 IEEE Standards Office P.O. Box 1331 445 Hoes Lane Piscataway, NJ 08855-1331 +1 (908) 562-3811 2 FAX: +1 (908) 562-1571 Additional copies of this draft are available for a duplication and 2 mailing fee. Contact: 2 IEEE Publications 2 1 (800) 678-IEEE 2 +1 (908) 981-1393 [outside US] 2 This draft is available in various electronic forms to assist the review 2 process. Our thanks to Andrew Hume of AT&T Bell Laboratories for 2 providing online access facilities. Note that this is a limited 2 experiment in providing online access; future ballots may provide other 2 forms, such as diskettes or a bulletin board arrangement, but the 2 instructions shown here are the only methods currently available. Please 2 also observe the additional copyright restrictions that are described in 2 the online files. 2 Assuming you have access to the Internet, the scenario is approximately 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. ftp research.att.com # research's IP address is 192.20.225.2 2 2 cd posix/p1003.2/d11.2 2 get toc index 2 binary 2 get p11-20.Z 2 The draft is available in several forms. The table of contents can be 2 found in toc, pages containing a particular section are stored under the 2 section number, sets of pages are stored in files with names of the form 2 p_n-_m, and the entire draft is stored in all. By default, files are 2 ASCII. A .ps suffix indicates PostScript. A .Z suffix indicates a 2 compress'_e_d file. The file index contains a general description of the 2 files available. 2 These files are also available via electronic mail by sending a message 2 like 2 send 3.4 3.5 9.2 from posix/p1003.2/d11.2 2 to netlib@research.att.com. If you use email, you should _n_o_t ask for the 2 compressed version. For a more complete introduction to this form of 2 _n_e_t_l_i_b, send the message 2 send help 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. _P_O_S_I_X._2 _C_h_a_n_g_e _H_i_s_t_o_r_y This section is provided to track major changes between drafts. Since it was first added in Draft 11, earlier entries omit some degree of detail. Draft 11.2 [September 1991] Sixth IEEE ballot (fifth recirculation; 2 only changed pages distributed). Second ISO/IEC CD 9945-2 2 registration (full draft distributed). 2 - Equivalence classes as starting/ending points of 2 regular expression bracket expression range expression 2 have been made unspecified. 2 - The LC_COLLATE substitute keyword has been deleted. 2 - cksum (4.9): Modifications to the algorithm. 2 - cp (4.13): Restoration of the 2 - stty (4.59): Addition of the tostop operand. 2 - lex (A.2): Further clarification of ERE differences. 2 - Miscellaneous clarifications to various utilities. 2 Draft 11.1 [June 1991] Fifth IEEE ballot (fourth recirculation; only 1 changed pages distributed). 1 - Modification of the definition of _b_y_t_e and 1 clarifications of octal/hexadecimal byte 1 representations throughout the utilities. 1 - Clarifications to the locale definition source file 1 description in 2.5; addition of a yacc grammar. 1 - Removal of pax -e character translation option. 1 - Miscellaneous clarifications to various utilities. 1 - Reconciliation of feature test macros and headers in 1 Annex B with POSIX.1. 1 Draft 11 [February 1991] Fourth IEEE ballot (third recirculation). - Changes in 2.3 to the treatment of regular built-ins in regards to their _e_x_e_c-able versions. - Changes to 2.4 (character names and charmap syntax) and 2.5 (localedef input format) as a result of Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. international balloting. Addition of the {POSIX2_LOCALEDEF} symbol. - Changes to the shell quoting rules, arithmetic expression syntax, command search order, error descriptions, and exportable functions. - Movement of the command utility from special built-in status to be a utility in Section 4. - cp (4.13): Significant clarifications and interface changes. - date (4.15): Added field descriptor modifiers to handle alternate calendar forms when supported by the locale and implementation. - pax (4.48): Significant interface changes, including international character set translations. - test (4.62): Deprecated some functionality due to inconsistent behavior in existing implementations that cause portability problems in existing applications. - make (6.2): Addition of the .POSIX special target, return of some rules to strict existing practice. - Miscellaneous clarifications to various utilities. - The FORTRAN section now has two options associated with it: Development Utilities (fort77) and Runtime Utilities (asa). - Addition of full example profiles and charmaps from Denmark in Annex F. Draft 10 [July 1990] Third IEEE ballot (second recirculation). - This draft primarily has been one of clarification and amplification. In resolving ballot objections, large portions of the draft have been rewritten, affecting all sections, but comparatively few changes in [intended] functionality have occurred. - New shell command language features (see Section 3): - Utility name changes: Draft 9 Draft 10 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. _______ ________ create pathchk hexdump od sendto mailx - A few of the utilities and global sections now have a more formal description, using a yacc-like grammar. - Considerably more detail has been added to the internationalization features of the standard: global changes to clauses 2.4 and 2.5; new detail to the LC_* variables in each utility section; specification of LC_MESSAGES (replacing LC_RESPONSE). - Due to some ISO requirements, Sections 1 and 2 have been reorganized yet again, causing many cross reference number changes. The Related Standards annex has been turned into simply a Bibliography. The Non- Specified Language Compilers annex has been replaced by a Sample National Profile annex. Draft 9 [August 1989] Second IEEE ballot (first recirculation). Also registered as ISO/IEC CD 9945-2.1. A few minor corrections to some sections. :-) Draft 8 [December 1988] First IEEE ballot. Also submitted to ISO/IEC JTC 1/SC22 for review and comment. Draft 7 [September 1988] ``Mock ballot'' conducted by working group members only. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. _P_O_S_I_X._2 _T_e_c_h_n_i_c_a_l _R_e_v_i_e_w_e_r_s The individuals denoted in Table i are the Technical Reviewers for this draft. During balloting they are the subject matter experts who coordinate the resolution process for specific sections, as shown. Table i - POSIX.2 Technical Reviewers __________________________________________________________________________________________________________________________________________________ Section Description Reviewer ___________________________________________________________________ 1 _G_e_n_e_r_a_l Jespersen 2.4,2.5 _D_e_f_i_n_i_t_i_o_n_s (_L_o_c_a_l_e_s) Leijonhufvud 1 2 (rest) _D_e_f_i_n_i_t_i_o_n_s (_V_a_r_i_o_u_s) Jespersen 3 _C_o_m_m_a_n_d _L_a_n_g_u_a_g_e Jespersen 4 _E_x_e_c_u_t_i_o_n _E_n_v_i_r_o_n_m_e_n_t _U_t_i_l_i_t_i_e_s: _c_p, rm Bostic 22 4 _E_x_e_c_u_t_i_o_n _E_n_v_i_r_o_n_m_e_n_t _U_t_i_l_i_t_i_e_s: (_t_h_e Jespersen 22 _r_e_s_t) 2 6 _S_o_f_t_w_a_r_e _D_e_v_e_l_o_p_m_e_n_t _U_t_i_l_i_t_i_e_s Jespersen 7 _L_a_n_g_u_a_g_e-_I_n_d_e_p_e_n_d_e_n_t _B_i_n_d_i_n_g_s Jespersen 2 A _C _D_e_v_e_l_o_p_m_e_n_t _U_t_i_l_i_t_i_e_s Jespersen B _C _B_i_n_d_i_n_g_s Jespersen 2 C _F_O_R_T_R_A_N _D_e_v_e_l_o_p_m_e_n_t _a_n_d _R_u_n_t_i_m_e _U_t_i_l_i_t_i_e_s Jespersen D-G _V_a_r_i_o_u_s Jespersen __________________________________________________________________________________________________________________________________________________ Also, our special thanks to Donn Terry for writing or improving all the yacc-based grammars used in Draft 10. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. _P_O_S_I_X._2 _P_r_o_p_o_s_e_d _S_c_h_e_d_u_l_e This section will not appear in the final document. It is used to provide editorial notes regarding the proposed POSIX.2 schedule. In the schedule, the UPE stands for ``User Portability Extension.'' _____________________________________________________________________ | Date | Milestone (End of Meeting) | Draft | _|_______________________|______________________________________|_______| |Sep 7-11, 1987 | Utility format frozen; | 3 | |Nashua, NH | 10% of utilities described. | | _|_______________________|______________________________________|_______| |Dec 7-14, 87 | 50% of utilities described; | 4 | |San Diego, CA | shell update; substantial | | _|_______________________|_p_r_o_g_r_e_s_s__i_n__S_e_c_t_i_o_n_s__2_,__3_,__4_,__8_.______|_______| |Mar 14-18, 1988 | Utility selection frozen; | 5 | |Washington, DC | 75% described. | | _|_______________________|______________________________________|_______| |Jul 11-15, 1988 | 100% utilities described; | 6 | |Denver, CO | functional freeze; produce ``mock | | _|_______________________|_b_a_l_l_o_t_'_'__a_n_d__P_O_S_I_X__F_I_P_S__d_r_a_f_t__7_______|_______| |[Sep-Oct 1988] | [Mock ballot] | 7 | _|_______________________|______________________________________|_______| |Oct 24-28, 1988 | Resolve mock ballot objections; | 7 | |Honolulu, HI | produce first real ballot (draft 8) | | _|_______________________|_U_P_E__p_l_a_n_n_i_n_g__b_e_g_i_n_s___________________|_______| |[Jan-Feb 1989] | [First ballot] | 8 | _|_______________________|______________________________________|_______| |Jan 9-11, 1989 | Begin UPE definitions; | 8 | |Ft. Lauderdale, FL | Technical Reviewer coordination | | _|_______________________|_o_f__f_i_r_s_t__b_a_l_l_o_t__r_e_s_p_o_n_s_e_s_____________|_______| |[Feb-Apr 1989] | [Ballot resolution] | 8 | _|_______________________|______________________________________|_______| |Apr 24-28, 1989 | Working Group concurrence with | 9 | |Minneapolis, MN | ballot resolution; produce Draft 9 | | _|_______________________|_f_o_r__r_e_c_i_r_c_u_l_a_t_i_o_n_;__U_P_E__w_o_r_k___________|_______| |Jul 10-14, 1989 | UPE work | | |San Jose, CA | | | _|_______________________|______________________________________|_______| _|[_O_c_t__1_9_8_9_]______________|_[_F_i_r_s_t__R_e_c_i_r_c_u_l_a_t_i_o_n_]_________________|___9____| |[Nov-Feb 1990] | [Ballot resolution] | 9 | _|_______________________|______________________________________|_______| _|[_A_u_g_-_S_e_p__1_9_9_0_]__________|_[_S_e_c_o_n_d__R_e_c_i_r_c_u_l_a_t_i_o_n_]________________|__1_0____| |[Mar 1991] | [Third Recirculation] | 11 | _|_______________________|______________________________________|_______| _|[_J_u_n__1_9_9_1_]______________|_[_F_o_u_r_t_h__R_e_c_i_r_c_u_l_a_t_i_o_n_]________________|_1_1_._1___| 11 _|_______________________|______________________________________|_______| 11111 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. |[Sep 1991] | [Fifth Recirculation] | 11.2 | 1 _|_______________________|______________________________________|_______| 1 _|[_m_i_d_-_1_9_9_2_]______________|_[_I_E_E_E__S_t_a_n_d_a_r_d__B_o_a_r_d__A_p_p_r_o_v_e_s_?_?_]______|__1_2____| 21 |[Jul 1990 - Apr 1992] | [Ballot .2a UPE supplement] | | 1 _|_______________________|______________________________________|_______| END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. IEEE Standards documents are developed within the Technical Committees of the IEEE Societies and the Standards Coordinating Committees of the IEEE Standards Board. Members of the committees serve voluntarily and without compensation. They are not necessarily members of the Institute. The standards developed within IEEE represent a consensus of the broad expertise on the subject within the Institute as well as those activities outside of IEEE that have expressed an interest in participating in the development of the standard. Use of an IEEE Standard is wholly voluntary. The existence of an IEEE Standard does not imply that there are no other ways to produce, test, measure, purchase, market, or provide other goods and services related to the scope of the IEEE Standard. Furthermore, the viewpoint expressed at the time a standard is approved and issued is subject to change brought about through developments in the state of the art and comments received from users of the standard. Every IEEE Standard is subjected to review at least every five years for revision or reaffirmation. When a document is more than five years old and has not been reaffirmed, it is reasonable to conclude that its contents, although still of some value, do not wholly reflect the present state of the art. Users are cautioned to check to determine that they have the latest edition of any IEEE Standard. Comments for revision of IEEE Standards are welcome from any interested party, regardless of membership affiliation with IEEE. Suggestions for changes in documents should be in the form of a proposed change of text, together with appropriate supporting comments. Interpretations: Occasionally questions may arise regarding the meaning of portions of standards as they relate to specific applications. When the need for interpretations is brought to the attention of the IEEE, the Institute will initiate action to prepare appropriate responses. Since IEEE Standards represent a consensus of all concerned interests, it is important to ensure that any interpretation has also received the concurrence of a balance of interests. For this reason, the IEEE and the members of its technical committees are not able to provide an instant response to interpretation requests except in those cases where the matter has previously received formal consideration. Comments on standards and requests for interpretations should be addressed to: Secretary, IEEE Standards Board 445 Hoes Lane P.O. Box 1331 Piscataway, NJ 08855-1331 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. __________________________________________________________________ |IEEE Standards documents are adopted by the Institute of | |Electrical and Electronics Engineers without regard | |to whether their adoption may involve patents | |on articles, materials, or processes. | |Such adoption does not assume any liability to any patent owner, | |nor does it assume any obligation whatever to parties adopting | _||t_h_e__s_t_a_n_d_a_r_d_s__d_o_c_u_m_e_n_t_s_.__________________________________________|| Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. Contents PAGE Introduction....................................................... ii Organization of the Standard.................................... ii Base Documents.................................................. ii Related Standards Activities.................................... ii Section 1: General................................................. 1 1.1 Scope..................................................... 1 1.2 Normative References...................................... 13 1.3 Conformance............................................... 14 Section 2: Terminology and General Requirements.................... 21 2.1 Conventions............................................... 21 2.2 Definitions............................................... 26 2.3 Built-in Utilities........................................ 58 2.4 Character Set............................................. 61 2.5 Locale.................................................... 69 2.6 Environment Variables..................................... 119 2.7 Required Files............................................ 126 2.8 Regular Expression Notation............................... 128 2.9 Dependencies on Other Standards........................... 161 2.10 Utility Conventions....................................... 172 2.11 Utility Description Defaults.............................. 182 2.12 File Format Notation...................................... 198 2.13 Configuration Values...................................... 204 Section 3: Shell Command Language.................................. 215 3.1 Shell Definitions......................................... 217 3.2 Quoting................................................... 220 3.3 Token Recognition......................................... 224 3.4 Reserved Words............................................ 226 3.5 Parameters and Variables.................................. 228 3.6 Word Expansions........................................... 233 3.7 Redirection............................................... 249 3.8 Exit Status and Errors.................................... 255 3.9 Shell Commands............................................ 258 3.10 Shell Grammar............................................. 279 3.11 Signals and Error Handling................................ 288 3.12 Shell Execution Environment............................... 289 3.13 Pattern Matching Notation................................. 291 3.14 Special Built-in Utilities................................ 295 Section 4: Execution Environment Utilities......................... 317 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. ii PAGE 4.1 awk - Pattern scanning and processing language............ 317 4.2 basename - Return nondirectory portion of pathname........ 358 4.3 bc - Arbitrary-precision arithmetic language.............. 362 4.4 cat - Concatenate and print files......................... 383 4.5 cd - Change working directory............................. 388 4.6 chgrp - Change file group ownership....................... 392 4.7 chmod - Change file modes................................. 395 4.8 chown - Change file ownership............................. 405 4.9 cksum - Write file checksums and sizes.................... 409 4.10 cmp - Compare two files................................... 416 4.11 comm - Select or reject lines common to two files......... 420 4.12 command - Execute a simple command........................ 424 4.13 cp - Copy files........................................... 430 4.14 cut - Cut out selected fields of each line of a file...... 440 4.15 date - Write the date and time............................ 445 4.16 dd - Convert and copy a file.............................. 452 4.17 diff - Compare two files.................................. 462 4.18 dirname - Return directory portion of pathname............ 471 4.19 echo - Write arguments to standard output................. 475 4.20 ed - Edit text............................................ 479 4.21 env - Set environment for command invocation.............. 498 4.22 expr - Evaluate arguments as an expression................ 503 4.23 false - Return false value................................ 509 4.24 find - Find files......................................... 511 4.25 fold - Fold lines......................................... 521 4.26 getconf - Get configuration values........................ 526 4.27 getopts - Parse utility options........................... 531 4.28 grep - File pattern searcher.............................. 537 4.29 head - Copy the first part of files....................... 545 4.30 id - Return user identity................................. 549 4.31 join - Relational database operator....................... 554 4.32 kill - Terminate or signal processes...................... 559 4.33 ln - Link files........................................... 566 4.34 locale - Get locale-specific information.................. 570 4.35 localedef - Define locale environment..................... 577 4.36 logger - Log messages..................................... 583 4.37 logname - Return user's login name........................ 586 4.38 lp - Send files to a printer.............................. 589 4.39 ls - List directory contents.............................. 595 4.40 mailx - Process messages.................................. 605 4.41 mkdir - Make directories.................................. 610 4.42 mkfifo - Make FIFO special files.......................... 614 4.43 mv - Move files........................................... 617 4.44 nohup - Invoke a utility immune to hangups................ 623 4.45 od - Dump files in various formats........................ 627 4.46 paste - Merge corresponding or subsequent lines of files..................................................... 637 4.47 pathchk - Check pathnames................................. 642 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. iii PAGE 4.48 pax - Portable archive interchange........................ 648 4.49 pr - Print files.......................................... 665 4.50 printf - Write formatted output........................... 672 4.51 pwd - Return working directory name....................... 679 4.52 read - Read a line from standard input.................... 682 4.53 rm - Remove directory entries............................. 686 4.54 rmdir - Remove directories................................ 692 4.55 sed - Stream editor....................................... 695 4.56 sh - Shell, the standard command language interpreter..... 706 4.57 sleep - Suspend execution for an interval................. 713 4.58 sort - Sort, merge, or sequence check text files.......... 716 4.59 stty - Set the options for a terminal..................... 725 4.60 tail - Copy the last part of a file....................... 736 4.61 tee - Duplicate standard input............................ 742 4.62 test - Evaluate expression................................ 745 4.63 touch - Change file access and modification times......... 756 4.64 tr - Translate characters................................. 762 4.65 true - Return true value.................................. 770 4.66 tty - Return user's terminal name......................... 772 4.67 umask - Get or set the file mode creation mask............ 775 4.68 uname - Return system name................................ 780 4.69 uniq - Report or filter out repeated lines in a file...... 784 4.70 wait - Await process completion........................... 790 4.71 wc - Word, line, and byte count........................... 795 4.72 xargs - Construct argument list(s) and invoke utility..... 799 Section 5: User Portability Utilities Option....................... 807 Section 6: Software Development Utilities Option................... 809 6.1 ar - Create and maintain library archives................. 809 6.2 make - Maintain, update, and regenerate groups of programs.................................................. 818 6.3 strip - Remove unnecessary information from executable files..................................................... 844 Section 7: Language-Independent System Services.................... 847 7.1 Shell Command Interface................................... 848 7.2 Access Environment Variables.............................. 849 7.3 Regular Expression Matching............................... 849 7.4 Pattern Matching.......................................... 850 7.5 Command Option Parsing.................................... 850 7.6 Generate Pathnames Matching a Pattern..................... 850 7.7 Perform Word Expansions................................... 851 7.8 Get POSIX Configurable Variables.......................... 851 7.9 Locale Control............................................ 852 Annex A (normative) C Language Development Utilities Option........ 855 A.1 c89 - Compile Standard C programs......................... 856 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. iv PAGE A.2 lex - Generate programs for lexical tasks................. 867 A.3 yacc - Yet another compiler compiler...................... 884 Annex B (normative) C Language Bindings Option..................... 907 B.1 C Language Definitions.................................... 908 B.1.1 POSIX Symbols...................................... 908 B.1.2 Headers and Function Prototypes.................... 910 B.1.3 Error Numbers...................................... 911 B.2 C Numerical Limits........................................ 911 B.2.1 C Macros for Symbolic Limits....................... 912 B.2.2 Compile-Time Symbolic Constants for Portability Specifications..................................... 913 B.2.3 Execution-Time Symbolic Constants for Portability Specifications..................................... 914 B.2.4 POSIX.1 C Numerical Limits......................... 915 B.3 C Binding for Shell Command Interface..................... 915 B.3.1 C Binding for Execute Command...................... 916 B.3.2 C Binding for Pipe Communications with Programs.... 919 B.4 C Binding for Access Environment Variables................ 925 B.5 C Binding for Regular Expression Matching................. 925 B.6 C Binding for Match Filename or Pathname.................. 934 B.7 C Binding for Command Option Parsing...................... 937 B.8 C Binding for Generate Pathnames Matching a Pattern....... 942 B.9 C Binding for Perform Word Expansions..................... 948 B.10 C Binding for Get POSIX Configurable Variables............ 954 B.11 C Binding for Locale Control.............................. 957 Annex C (normative) FORTRAN Development and Runtime Utilities Options......................................................... 959 C.1 asa - Interpret carriage-control characters............... 960 C.2 fort77 - FORTRAN compiler................................. 964 Annex D (informative) Bibliography................................. 973 Annex E (informative) Rationale and Notes.......................... 977 E.1 General................................................... 977 E.2 Terminology and General Requirements...................... 978 E.3 Shell Command Language.................................... 979 E.4 Execution Environment Utilities........................... 980 E.5 User Portability Utilities Option......................... 993 E.6 Software Development Utilities Option..................... 993 E.7 Language-Independent System Services...................... 994 E.8 C Language Development Utilities Option................... 994 E.9 C Language Bindings Option................................ 995 E.10 FORTRAN Development and Runtime Utilities Options......... 996 Annex F (informative) Sample National Profile...................... 997 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. v PAGE Annex G (informative) Balloting Instructions....................... 1091 Identifier Index................................................... 1105 Alphabetic Topical Index........................................... 1111 FIGURES Figure B-1 - Sample _ssss_yyyy_ssss_tttt_eeee_mmmm() Implementation....................... 922 Figure B-2 - Sample _pppp_cccc_llll_oooo_ssss_eeee() Implementation....................... 926 Figure B-3 - Example Regular Expression Matching.................. 933 Figure B-4 - Argument Processing with _gggg_eeee_tttt_oooo_pppp_tttt().................... 942 TABLES Table 2-1 - Typographical Conventions............................. 22 Table 2-2 - Regular Built-in Utilities............................ 58 Table 2-3 - Character Set and Symbolic Names...................... 62 Table 2-4 - Control Character Set................................. 63 Table 2-5 - LC_CTYPE Category Definition in the POSIX Locale...... 76 Table 2-6 - Valid Character Class Combinations.................... 81 Table 2-7 - LC_COLLATE Category Definition in the POSIX Locale.... 84 Table 2-8 - LC_MONETARY Category Definition in the POSIX Locale... 96 Table 2-9 - LC_NUMERIC Category Definition in the POSIX Locale.... 101 Table 2-10 - LC_TIME Category Definition in the POSIX Locale...... 102 Table 2-11 - LC_MESSAGES Category Definition in the POSIX Locale.. 106 Table 2-12 - BRE Precedence....................................... 136 Table 2-13 - ERE Precedence....................................... 139 Table 2-14 - C Standard Operators and Functions................... 171 Table 2-15 - Escape Sequences..................................... 199 Table 2-16 - Utility Limit Minimum Values......................... 205 Table 2-17 - Symbolic Utility Limits.............................. 206 Table 2-18 - Optional Facility Configuration Values............... 212 Table 4-1 - awk Expressions in Decreasing Precedence.............. 322 Table 4-2 - awk Escape Sequences.................................. 347 Table 4-3 - bc Operators.......................................... 370 Table 4-4 - ASCII to EBCDIC Conversion............................ 459 Table 4-5 - ASCII to IBM EBCDIC Conversion........................ 460 Table 4-6 - dirname Examples...................................... 474 Table 4-7 - expr Expressions...................................... 505 Table 4-8 - od Named Characters................................... 632 Table 4-9 - stty Control Character Names.......................... 730 Table 4-10 - stty Circumflex Control Characters................... 731 Table 7-1 - POSIX.1 Numeric-Valued Configurable Variables......... 853 Table A-1 - lex Table Size Declarations........................... 873 Table A-2 - lex Escape Sequences.................................. 875 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. vi Table A-3 - lex ERE Precedence.................................... 877 Table A-4 - yacc Internal Limits.................................. 903 Table B-1 - POSIX.2 Reserved Header Symbols....................... 911 Table B-2 - _POSIX_C_SOURCE....................................... 911 Table B-3 - C Macros for Symbolic Limits.......................... 914 Table B-4 - C Compile-Time Symbolic Constants..................... 916 Table B-5 - C Execution-Time Symbolic Constants................... 916 Table B-6 - Structure Type _rrrr_eeee_gggg_eeee_xxxx______tttt................................ 928 Table B-7 - Structure Type _rrrr_eeee_gggg_mmmm_aaaa_tttt_cccc_hhhh______tttt............................. 928 Table B-8 - _rrrr_eeee_gggg_cccc_oooo_mmmm_pppp() _cccc_ffff_llll_aaaa_gggg_ssss Argument............................. 928 Table B-9 - _rrrr_eeee_gggg_eeee_xxxx_eeee_cccc() _eeee_ffff_llll_aaaa_gggg_ssss Argument............................. 928 Table B-10 - _rrrr_eeee_gggg_cccc_oooo_mmmm_pppp(), _rrrr_eeee_gggg_eeee_xxxx_eeee_cccc() Return Values................... 932 Table B-11 - _ffff_nnnn_mmmm_aaaa_tttt_cccc_hhhh() _ffff_llll_aaaa_gggg_ssss Argument............................. 937 Table B-12 - Structure Type _gggg_llll_oooo_bbbb______tttt................................ 944 Table B-13 - _gggg_llll_oooo_bbbb() _ffff_llll_aaaa_gggg_ssss Argument................................ 945 Table B-14 - _gggg_llll_oooo_bbbb() Error Return Values........................... 947 Table B-15 - Structure Type _wwww_oooo_rrrr_dddd_eeee_xxxx_pppp______tttt............................. 950 Table B-16 - _wwww_oooo_rrrr_dddd_eeee_xxxx_pppp() _ffff_llll_aaaa_gggg_ssss Argument............................. 951 Table B-17 - _wwww_oooo_rrrr_dddd_eeee_xxxx_pppp() Return Values.............................. 952 Table B-18 - confstr() _nnnn_aaaa_mmmm_eeee Values................................ 955 Table B-19 - C Bindings for Numeric-Valued Configurable Variables........................................................ 958 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. vii Introduction (This Introduction is not a normative part of P1003.2 Information technology -- Portable Operating System Interface (POSIX) -- Part 2: Shell and Utilities, but is included for information only.) The purpose of this standard is to define a standard interface and environment for application programs that require the services of a ``shell'' command language interpreter and a set of common utility programs. It is intended for systems implementors and application software developers, and is complementary to ISO/IEC 9945-1: 1990 {8} (first in a family of ``POSIX'' standards), which specifies operating system interfaces and source code level functions, based on the UNIX1) system documentation. This standard, or ``POSIX.2,'' is based upon documentation and the knowledge of existing programs that assume an interface and architecture similar to that described by POSIX.1. (See 1.1 for a full description of the relationship between the standards.) The majority of this standard describes the functions of utilities that can interface with application programs. The standard also provides high-level language interfaces that the application uses to access these utilities and other useful, related services. These language-independent service interfaces are temporarily described in terms of their C language bindings. The C language assumed is that defined by the C Standard: _A_N_S_I/_X_3._1_5_9-_1_9_8_9 _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e _C _S_t_a_n_d_a_r_d produced by Technical Committee X3J11 of the Accredited Standards Committee X3 -- Information Processing Systems. Organization of the Standard The standard is divided into ten parts: - General, including a statement of scope, normative references, and conformance requirements. (Section 1). - Definitions, general requirements, and the environment available to applications. (Section 2). __________ 1) UNIX is a registered trademark of UNIX System Laboratories in the USA and other countries. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. viii Introduction - The shell command interpreter language. (Section 3). - Descriptions of the utilities in the required ``Execution Environment Utilities.'' (Section 4). - Descriptions of the utilities required for user portability on asynchronous terminals. (Section 5 [to be provided in a future revision]). - Descriptions of the utilities in the optional ``Software Development Utilities.'' (Section 6). - Language-independent interfaces for high-level programming language access to shell and related services. (Section 7). - Descriptions of the utilities in the optional ``C Language Development Utilities.'' (Normative Annex A). - C language bindings to the interfaces in Section 6. (Normative Annex B). - Descriptions of the utilities in the optional ``FORTRAN Development and Runtime Utilities.'' (Normative Annex C). This introduction, the foreword, any footnotes, NOTES accompanying the text, and the _i_n_f_o_r_m_a_t_i_v_e annexes are not considered part of the standard. Annexes D through G are informative. Base Documents Many of the interfaces and utilities of this standard were adapted from materials in machine-readable forms donated by the following organizations: - AT&T: the _S_y_s_t_e_m _V _I_n_t_e_r_f_a_c_e _D_e_f_i_n_i_t_i_o_n (_S_V_I_D) {B24},2) Issue 2, Volume 2. Copyright c 1986, AT&T; reprinted with permission. - The X/Open Company, Ltd.: the _X/_O_p_e_n _P_o_r_t_a_b_i_l_i_t_y _G_u_i_d_e {B30} {B31}, Issues II and III, Volume 1. Copyright c 1989, X/Open Company, Ltd; reprinted with permission. __________ 2) The number in braces corresponds to those of the references in 1.2 (or the bibliographic entry in Annex D if the number is preceded by the letter B). Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. ix - University of California, _T_h_e _U_N_I_X _U_s_e_r'_s _R_e_f_e_r_e_n_c_e _M_a_n_u_a_l {B28}, 4.3 Berkeley Software Distribution, Virtual VAX-11 Version, 1986. Copyright c 1980, 1983, The Regents of the University of California; reprinted with permission.3) Significant reference use was also made of the following books: - Bolsky, Morris I., Korn, David G., _T_h_e _K_o_r_n_S_h_e_l_l _C_o_m_m_a_n_d _a_n_d _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e {B25}, Prentice Hall, Englewood Cliffs, New Jersey (1988). - Aho, Alfred V., Kernighan, Brian W., Weinberger, Peter J., _T_h_e _A_W_K _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e {B21}, Addison-Wesley, Reading, Massachusetts (1988). Many other proposals for functions and utilities were received from the various working group members, who are listed in the Acknowledgements section of this standard. Related Standards Activities Activities to extend this standard to address additional requirements are in progress, and similar efforts can be anticipated in the future. The following areas are under active consideration at this time, or are expected to become active in the near future:4) (1) Language-independent service descriptions of POSIX.1 {8} (2) C, Ada, and FORTRAN Language bindings to (1) (3) Verification testing methods (4) Realtime facilities __________ 3) The IEEE is grateful to AT&T, UniForum, and the Regents of the University of California for permission to use their machine-readable materials. 4) A _S_t_a_n_d_a_r_d_s _S_t_a_t_u_s _R_e_p_o_r_t that lists all current IEEE Computer Society standards projects is available from the IEEE Computer Society, 1730 Massachusetts Avenue NW, Washington, DC 20036-1903; Telephone: +1 202 371-0101; FAX: +1 202 728-9614. Working drafts of POSIX standards under development are also available from this office. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. x Introduction (5) Secure/Trusted System considerations (6) Network interface facilities (7) System Administration (8) Graphical User Interfaces (9) Profiles describing application- or user-specific combinations of Open Systems standards for: supercomputing, multiprocessor, and batch extensions; transaction processing; realtime systems; and multiuser systems based on historical models (10) An overall guide to POSIX-based or related Open Systems standards and profiles Extensions are approved as ``amendments'' or ``revisions'' to this document, following the IEEE and ISO/IEC Procedures. Approved amendments are published separately until the full document is reprinted and such amendments are incorporated in their proper positions. If you have interest in participating in the TCOS working groups addressing these issues, please send your name, address, and phone number to the Secretary, IEEE Standards Board, Institute of Electrical and Electronics Engineers, Inc., P.O. Box 1331, 445 Hoes Lane, Piscataway, NJ 08855-1331, and ask to have this forwarded to the chairperson of the appropriate TCOS working group. If you have interest in participating in this work at the international level, contact your ISO/IEC national body. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. Related Standards Activities xi P1003.2 was prepared by the 1003.2 working group, sponsored by the Technical Committee on Operating Systems and Application Environments of the IEEE Computer Society. At the time this standard was approved, the membership of the 1003.2 working group was as follows: Technical Committee on Operating Systems and Application Environments (TCOS) Chair: Jehan-Franc,ois Pa^ris TCOS Standards Subcommittee Chair: Jim Isaak Vice Chairs: Ralph Barker David Dodge Robert Bismuth Hal Jespersen Lorraine Kevra Treasurer: Quin Hahn Secretary: Shane McCarron 1003.2 Working Group Officials Chair: Hal Jespersen Vice Chair: Donald W. Cragun Editors: Hal Jespersen (1986, 1988-1991) Maggie Lee (1987-1988) Secretaries: Helene Armitage (1988-1990) Dave Grindeland (1991) Robert J. Makowski (1987-1988) Technical Reviewers Helene Armitage Ken Faubel Gary Miller Keith Bostic Greger Leijonhufvud Marc Teitelbaum John Caywood Bob Lenk Donn Terry Donald Cragun Mark Levine Teoman Topcubasi David Decot Shane McCarron David Willcox Working Group Helene Armitage Quin Hahn Jim Oldroyd Brian Baird Michael J. Hannah Mark Parenti John R. Barr Marjorie E. Harris John Peace Philippe Bertrand David F. Hinnant Jon Penner Robert Bismuth Leon M. Holmes Gerald Powell Jim Blondeau Ron Holt John Quarterman Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. xii Introduction James C. Bohem Randall Howard Joe Ramus Kathy Bohrer Steven A. James Mike Ressler Keith Bostic Steve Jennings Grover Righter Phyllis Eve Bregman Hal Jespersen Andrew K. Roach Peter Brouwer Ronald S. Karr Marco P. Roodzant F. Lee Brown, Jr. Lorraine C. Kevra Seth Rosenthal Jonathan Brown Martin Kirk Maude Sawyer James A. Capps Brad Kline Norman K. Scherer Bill Carpenter Hiromichi Kogure Glen Seeds Steve Carter David Korn Jim Selkaitis John Caywood Rick Kuhn Karen Sheaffer Bob Claeson Mike Lambert Del Shoemaker Mark Colburn Maggie Lee James Soddy Donald W. Cragun Perry Lee Daniel Steinberg Dave Decot Greger Leijonhufvud Scott A. Sutter Terence S. Dowling Bob Lenk Ravi Tavakley Stephen Dum Mark Levine Marc Teitelbaum Dominic Dunlop Gary Lindgren Donn Terry Mike Edmonds John Lomas Jack Thompson Ron Elliott Craig Lund Teoman Topcubasi Richard W. Elwood Rod MacDonald Eugene Tsuno Hirsaki Eto Dan Magenheimer Geraldine Vitovitch Fran Fadden Robert J. Makowski Carl vonLoewenfeldt Ken Faubel Shane P. McCarron Mike Wallace Martin C. Fong Jim McGinness Alan Weaver Terance Fong John McGrory Larry Wehr Glenn Fowler Stuart McKaig Bruce Weiner Gary A. Gaudet Sunil Mehta N. Ray Wilkes Al Gettier Bill Middlecamp David Willcox Timothy D. Gill Gary W. Miller Neil Winton Gregory Goddard Jim Moe David Woodend Loretta Goudie Yasushi Nakahara Morten With Dave Grindeland Martha Nalebuff Ken Witte John Lawrence Gregg Sonya D. Neufer John Wu Jerry Gross Landon Noll Peggy Younger Douglas A. Gwyn Robin T. O'Neill Hilary Zaloom Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. Related Standards Activities xiii The following persons were members of the 1003.2 Balloting Group that approved the standard for submission to the IEEE Standards Board: Derek Kaufman _X/_O_p_e_n _I_n_s_t_i_t_u_t_i_o_n_a_l _R_e_p_r_e_s_e_n_t_a_t_i_v_e Shane McCarron _U_N_I_X _I_n_t_e_r_n_a_t_i_o_n_a_l _I_n_s_t_i_t_u_t_i_o_n_a_l _R_e_p_r_e_s_e_n_t_a_t_i_v_e Peter Collinson _U_S_E_N_I_X _A_s_s_o_c_i_a_t_i_o_n _I_n_s_t_i_t_u_t_i_o_n_a_l _R_e_p_r_e_s_e_n_t_a_t_i_v_e Scott Anderson Carol J. Harkness Jim R. Oldroyd Helene Armitage Craig Harmer Craig Partridge David Athersych Dale Harris Rob Peglar Geoff Baldwin Myron Hecht John C. Penney Jerome E. Banasik Morris J. Herbert Rand S. Phares Steven E. Barber David F. Hinnant P. J. Plauger Robert M. Barned Lee A. Hollaar Gerald Powell David R. Bernstein Ronald Holt Jr. Scott E. Preece Kabekode V. S. Bhat Randall Howard James M. Purtilo Robert Bismuth Jim Isaak J. S. Quarterman Jim Blondran Richard James Wendy Rauch-Hindin Robert Borochoff Hal Jespersen Brad Rhoades Keith Bostic Greg Jones Christopher J. Riddick James P. Bound Michael J. Karels Andrew K. Roach Joseph Boykin Lorraine C. Kevra Arnold Robbins Kevin Brady Alan W. Kiecker R. Hughes Rowlands Phyllis Eve Bregman Jeff Kimmel Robert Sarr A. Winsor Brown M. J. Kirk Norman Schneidewind F. Lee Brown Jr. Kenneth C. Klingman Wolfgang Schwabl Luis-Felipe Cabrera Joshua W. Knight Richard Scott Nicholas A. Camillone David Korn Glen Seeds Andres Caravallo Takahiko Kuki Dan Shia Steven L. Carter Robin B. Lake Roger Shimada John Caywood Mike Lambert Mukesh Singhal Kilnam Chon Doris Lebovits Richard Sniderman Chan F. Chong Maggie Lee Steven Sommars Robert L. Claeson Greger Leijonhufvud Bryan W. Sparks Mark Colburn Robert M. Lenk Richard Stallman Kenneth N. Cole David Lennert Daniel Steinberg Richard Cornelius Mark E. Levine Douglas H. Steves William M. Corwin Kevin Lewis Peter Sugar Mike R. Cossey Kin F. Li Scott A. Sutter William Cox James P. Lonjers Ravi Tavakley Donald W. Cragun Joseph F. P. Luhukay Donn Terry Terence Dowling Paul Lustgarten Gary F. Tom Stephen A. Dum Ron Mabe A. T. Twigger John D. Earls Robert J. Makowski Mark-Rene Uchida Ron Elliott Roger J. Martin L. David Umbaugh Richard W. Elwood Joberto S. B. Martins Michael W. Vannier David Emery Yoshihiro Matsumoto M. B. Wagner Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. xiv Introduction Philip H. Enslow Shane McCarron John W. Walz Ken Faubel Martin J. McGowan III Alan G. Weaver Terence Fong Marshall Kirk McKusick Larry Wehr Ed Frankenberry Robert W. McWhirter Bruce Weiner John A. Gertwagen Doug Michels Brian Weis Al Gettier Gary W. Miller Peter J. Weyman Michel Gien James M. Moe Andrew E. Wheeler Gregory W. Goddard J. W. Moore David Willcox Robert C. Groman Anita Mundkur Jeff Wubik Judy Guist Martha Nalebuff Oren Yuen Gregory Guthrie Fred Noz Jason Zions Michael J. Hannah Alan F. Nugent When the IEEE Standards Board approved this standard on <_d_a_t_e _t_o _b_e _p_r_o_v_i_d_e_d>, it had the following membership: (to be pasted in by IEEE) END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. Related Standards Activities xv P1003.2/D11.2 Information technology -- Portable Operating System Interface (POSIX) -- Part 2: Shell and Utilities Section 1: General 1.1 Scope This standard defines a standard source code level interface to command interpretation, or ``shell,'' services and common utility programs for application programs. These services and programs are complementary to those specified by ISO/IEC 9945-1: 1990 {8}, hereinafter referred to as ``POSIX.1 {8}.'' The standard has been designed to be used by both application programmers and system implementors. However, it is intended to be a reference document and not a tutorial on the use of the services, the utilities, or the interrelationships between the utilities. The emphasis of this standard is on the shell and utility functionality required by application programs (including ``shell scripts'') and not on the direct interactive use of the shell command language or the utilities by humans. Portions of this standard comprise optional language bindings to system service interfaces. See, for example, the C Language Bindings Option in Annex B. This standard is intended to describe language interfaces and utilities in sufficient detail so that an application developer can understand the required interfaces without access to the source code of existing implementations on which they may be based. Therefore, it does Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 1 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX not attempt to describe the source programming language or internal design of the utilities; they should be considered ``black boxes'' that exhibit the described functionality. For language interfaces, or functions, this standard has been defined exclusively at the source code level. The objective is that a conforming portable application source program can be translated to execute on a conforming implementation. The standard assumes that the source program may need to be retranslated to produce target code for a new environment prior to execution in that environment. There is no requirement that the base operating system supporting the shell and utilities be one that fully conforms to ISO/IEC 9945-1: 1990 {8}. (The base system could contain a subset of POSIX.1 {8} functionality, enough to support the requirements for this standard, as described in 2.9.1, but that could not claim full conformance to all of POSIX.1 {8}.) Furthermore, there is no requirement that the shell command interpreter or any of the standard utilities be written as POSIX.1 {8} conforming programs, or be written in any particular language. Although not requiring a fully conforming POSIX.1 {8} base, this standard is based upon documentation and the knowledge of existing programs that assume an interface and architecture similar to that described by POSIX.1 {8}. Any questions regarding the definition of terms or the semantics of an underlying concept should be referred to POSIX.1 {8}. BEGIN_RATIONALE 1.1.1 Scope Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) This standard is one of a family of related standards. The term POSIX is correctly used to describe this family, and not only its foundation, the operating system interfaces of POSIX.1 {8}. Therefore, POSIX.2 could colloquially be described as the ``POSIX Shell and Tools Standard.'' The interfaces documented for this standard are to and from high-level language application programs and to and from the utilities themselves; the standard does not directly address the interface with users. The ``source code'' interface to the command interpreter is defined in terms of high-level language functions in 7.1.1 or 7.1.2 (such as _s_y_s_t_e_m(), B.3.1, or _p_o_p_e_n(), B.3.2). There are also other function interfaces, such as those for matching regular expressions in 7.3 (_r_e_g_c_o_m_p() in B.5). Many of the utilities in this standard, and the shell itself, also accept their own command languages or complex directives as input data, which is also referred to as source code. This data, an ordered series of characters, may be stored in files, or Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 ``scripts,'' that are portable between systems without true recompilation. However, just as with POSIX.1 {8}, the standard addresses only the issue of source code portability between systems; applications using these calls may have to be recompiled or translated when moving from one system to another. There has been considerable debate concerning the appropriate scope of the work represented by this standard. The following are rational alternatives that have been evaluated: (1) Define the shell and tools as extensions to POSIX.1 {8}. This would require a full conforming POSIX.1 {8} system as a base for the new facilities described here. Vocal proponents for this view have been the members of the POSIX.3 working group, who foresaw difficulties in producing a verification suite standard without having a known operating system base. (2) Decouple the shell and tools entirely from POSIX.1 {8}. This would potentially allow the standard to be implemented on such popular operating systems as MVS/TSO, VM/CMS, MS/DOS, VMS, etc. Those systems would not have to provide every minor detail of the POSIX.1 {8} language interfaces to conform under this model- --only enough to support the shell and tools. (3) Compromise between options 1 and 2. Base the standard on an interface _s_i_m_i_l_a_r to POSIX.1 {8}, but don't require full conformance. A simple example would be a Version 7 UNIX System, which could not conform to POSIX.1 {8} without considerable modification. However, a vendor could support all of the features of this standard without changing its kernel or binary compatibility. Another example would be a system that conformed to all stated POSIX.1 {8} interfaces, but that didn't have a fully conforming C Standard {7} compiler. The difficulty with this option is that it makes the stated goal of the working group a bit fuzzier and increases the amount of analysis required for the features included. The working group selected option 3 as its goal. It chose to retain the full UNIX system-like orientation, but did not wish to arbitrarily deprive legitimate systems that could _a_l_m_o_s_t conform. No useful feature of shells or commonly-used utilities were discarded to accommodate nonconforming base systems; on the other hand, no deliberate obstacles were arbitrarily erected. Furthermore, POSIX.1 {8} is still required for its definitions and architectural concepts, which are purposely not repeated in this standard. One concrete example of how the two standards interrelate is in the usage of POSIX.1 {8} function names in the descriptions of utilities in POSIX.2. There are a number of historical commands that directly mapped Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 3 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX into one of the UNIX system calls. For example: chmod and _c_h_m_o_d(); ln and _l_i_n_k(). The POSIX.2 working group was faced with the problem of having to define all of the complex interactions ``behind the scenes'' for some simple commands. Creating a file, for example, involves many POSIX.1 {8} concepts, including processes, user IDs, multiple group permissions (which are optional), error conditions, etc. Rather than enumerating all of these interactions in many places, the POSIX.2 group chose to employ the POSIX.1 {8} function descriptions, where appropriate. See the chmod utility in 4.7 as an example. The utility description includes the phrase: ... performing actions equivalent to the _c_h_m_o_d() function as defined in the POSIX.1 {8} _c_h_m_o_d() function: This means that the POSIX.2 implementor has to read the POSIX.1 {8} _c_h_m_o_d() description and fully understand all of its functionality, requirements, and side effects, which now don't have to be repeated here. (Admittedly, this makes the POSIX.2 standard a bit more difficult to read, but the working group felt that precision transcended the need for readable or semi-tutorial documents.) The Introduction states that one of the goals of the working group was: ``This interface should be implementable on conforming POSIX.1 {8} systems.'' This implies that the working group has attempted to ensure that no additional functionality or extension is required to implement this standard on the base defined by POSIX.1 {8}. This is not to say that extensions are not allowed, but that they should not be necessary. The goal ``(7) Utilities and standards for the installation of applications" was once interpreted to mean that an elaborate series of tools was required to install and remove applications, based on complex description files and system databases of capabilities. An attempt to provide this was rejected by the balloting group and that type of system is now being evaluated by the POSIX.7 System Administration group. However, the original goal remains in the list, because many of the standard utilities are, in fact, targeted specifically for application installation--make, c89, lex, etc. 1.1.1.1 Existing Practice. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The working group would have been very happy to develop a standard that allowed all historical implementations (i.e., those existing prior to the time of publication) to be fully conforming and all historical applications to be Strictly Conforming POSIX Shell Applications without requiring any changes. Some modifications will be required to reconcile the specific differences between historical implementations; there are many divergent versions of UNIX systems extant and applications have sometimes been written to take advantage of features (or bugs) on specific systems. Therefore, the working group established a set of goals to maximize the value of the standard it eventually produced. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 4 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 These goals are enumerated in the following subclauses. They are listed in approximate priority sequence, where the first subclause is the most important portability goal. 1.1.1.1.1 Preserve Historical Applications The most important priority was to ensure that historical applications continued to operate on conforming implementations. This required the selection of many utilities and features from the most prevalent historical implementations. The working group is relying on the following factors: (1) Many inconsistent historical features will still be supported as _o_b_s_o_l_e_s_c_e_n_t. (2) Common features of System V and BSD will continue to be supported by their sponsors, even if they aren't included here (just as long as they are not prevented from existing). Therefore, the standard was written so that the large majority of well- written historical applications should continue to operate as Conforming POSIX Shell Applications Using Extensions. 1.1.1.1.2 Clean Up the Interfaces The working group chose to extend the benefits of historical UNIX systems by making limited improvements to the utility interfaces; numerous complaints have been heard over the years about the inconsistencies in the command line interface, which have allegedly made it harder for novice users. Given the constraints of Preserve Historical Applications, the working group has made the following general modifications: (1) Utilities have been extended to deal with differences in character sets, collating sequences, and some cultural aspects relating to the locale of the user. (Examples: new features in regular expressions; new formatting options in date; see 4.15.) (2) The utility syntax guidelines in 2.10.2 have been applied to almost all of the utilities to promote a consistent interface. The guidelines themselves have been loosened up a bit from their counterparts in the _S_V_I_D. In many cases historical utilities have not conformed with these guidelines (which were written considerably later than the utilities themselves). The older interfaces have been maintained in the standard as obsolescent features. (Examples: join, sort.) However, in some cases, such as dd and find, such major surgery was required that the working group decided to leave the historical interfaces as is. ``Fixing'' the interface would mean replacing the command, which would not help applications portability. So, fixing was limited Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 5 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX to relatively minor abuses of the new guidelines, where reasonable consistency could be achieved while still maintaining the general type of interface of the historical version. (3) Features that were not generally portable across machine architectures or systems have been removed or marked obsolescent and new, more portable interfaces have been introduced. (Examples: the octal number methods of describing file modes in chmod and other utilities have been marked obsolescent; the symbolic ``ugo'' method has been extended to other utilities, such as umask.) (4) Features that have proved to be popular in some specific UNIX system variants have been adopted. (Examples: diff -c, which originated in BSD systems, and the ``new'' awk, from System V.) Such features were selected given the requirements for balloting group consensus; the features had to be used widely enough to balance accusations of ``creeping featurism'' and violations of the UNIX system ``tools philosophy.'' (5) Unreasonable inconsistencies between otherwise similar interfaces have been reconciled. (Example: methods of specifying the patterns to the three grep-_r_e_l_a_t_e_d utilities have been made more consistent in the standard's single grep.) (6) When irreconcilable differences arose between versions of historical utilities, new interfaces (utility names or syntax) were sometimes added in their places. The working group resisted the urge to deviate significantly from historical practice; the new interfaces are generally consistent with the philosophy of historical systems and represent comparable functionality to the interfaces being replaced. In some cases, System V and BSD had diverged (such as with echo and sum) so significantly that no compromises for a common interface were possible. In these cases, either the divergent features were omitted or an entirely new command name was selected (such as with printf and cksum). (7) Arbitrary limits to utility operations have been removed. (Example: some historical ed utilities have very limited capabilities for dealing with large files or long input lines.) (8) Arbitrary limitations on historical extensions have been eliminated. (Example: regular expressions have been described so that the popular \< ... \> extension is allowed.) (9) Input and output formats have been specified in more detail than historical implementations have required, allowing applications to more effectively operate in pipelines with these utilities. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 6 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 (Example: comm.) Thus, in many cases the working group could be accused of ``violating Existing Practice,'' and in fact received some balloting objections to that effect from implementors (although rarely from users or application developers). The working group was sensitive to charges that it was engaged in arbitrary software engineering rather than merely codifying existing practice. When changes were made, they were always written to preserve historical applications, but to move new conforming applications into a more consistent, portable environment. This strategy obviously requires changes to historical implementations; the working group carefully evaluated each change, weighing the value to users against the one-time costs of adding the new interfaces (and of possibly breaking applications that took advantage of bugs), generally siding with the users when the costs to implementations and applications was not excessively high. In some cases, changes were reluctantly made that could conceivably break some historical applications; the working group allowed these only in the face of practices it considered rare or significantly misguided. 1.1.1.1.3 Allow Historical Conforming Applications It is likely that many historical shell scripts will be Strictly Conforming POSIX.2 Applications without requiring modifications. Developers have long been aware of the differences among the historical UNIX system variants and have avoided the nonportable aspects to increase the scope of their applications' marketplace. However, the previous goal of a consistent interface was considered to be quite important, so there will be modifications required to some applications if they wish to be maximally portable in the future. 1.1.1.1.4 Preserve Historical Implementations As explained in 1.1.1.1.2, the requirements for portability and a consistent interface have caused the working group to add new utilities and features. No historical implementations contained all of the attributes required by the working group. Therefore, this lowest priority goal fell victim to the preceding goals, and every known historical implementation will require some modifications to conform to this standard. The working group took care to ensure that the implementations could add the new or modified features without breaking the operation of existing applications. (Note that the standard utilities are not considered applications in this regard, but are part of the implementation. In fact, many or most of the utilities named by this standard will have to change to some extent.) Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 7 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 1.1.1.2 Outside the Scope. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The following areas are outside the scope of this standard. This subclause explains more of the rationale behind the exclusions. (It should be noted that this is not an official list. It was not part of the Project Authorization Request submitted to the IEEE, but was devised as a guide to keep the working group discussions on track.) (1) _O_p_e_r_a_t_i_n_g _s_y_s_t_e_m _a_d_m_i_n_i_s_t_r_a_t_i_v_e _c_o_m_m_a_n_d_s (_p_r_i_v_i_l_e_g_e_d _p_r_o_c_e_s_s_e_s, _s_y_s_t_e_m _p_r_o_c_e_s_s_e_s, _d_a_e_m_o_n_s, _e_t_c.). The working group followed the lead of the POSIX.1 {8} group in this instance. Administrative commands were felt to be too implementation dependent and not useful for application portability. Subsequent to this decision, a separate POSIX.7 working group was formed to deal with this area of ``operator portability.'' It is anticipated that utilities needed for system administration will be closely coordinated with the POSIX.2 working group. (2) _C_o_m_m_a_n_d_s _r_e_q_u_i_r_e_d _f_o_r _t_h_e _i_n_s_t_a_l_l_a_t_i_o_n, _c_o_n_f_i_g_u_r_a_t_i_o_n, _o_r _m_a_i_n_t_e_n_a_n_c_e _o_f _o_p_e_r_a_t_i_n_g _s_y_s_t_e_m_s _o_r _f_i_l_e _s_y_s_t_e_m_s. This area is similar to item (1). System installation is contrasted against the application installation portion of the Scope by its orientation to installing the operating system itself, versus application programs. The exclusion of operating system installation facilities should not be interpreted to mean that the application installation procedures _c_a_n_n_o_t be used for installing operating system components. The proposed interface for this area encountered stiff resistance from the balloting group in Draft 8 and was temporarily withdrawn. As described in Annex E.4, a decision of the balloting group is pending on whether to begin work on a supplement to this standard (POSIX.2b) for application installation. (3) _N_e_t_w_o_r_k_i_n_g _c_o_m_m_a_n_d_s. These were excluded because they are deeply involved with other standards making bodies and are probably too complicated. In this case, several working groups were formed within the POSIX family to deal with this. It is anticipated that utilities needed for networking, if any, will be closely coordinated with the POSIX.2 working group. (In early drafts of this standard, which predated the formation of the networking-specific POSIX working groups, the historical ``UNIX system to UNIX system copy [UUCP]'' programs and protocols were included. These descriptions have been removed in deference to a more appropriate working group.) Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 8 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 (4) _T_e_r_m_i_n_a_l _c_o_n_t_r_o_l _o_r _u_s_e_r-_i_n_t_e_r_f_a_c_e _p_r_o_g_r_a_m_s (_e._g., _v_i_s_u_a_l _s_h_e_l_l_s, _v_i_s_u_a_l _e_d_i_t_o_r_s, _w_i_n_d_o_w _m_a_n_a_g_e_r_s, _c_o_m_m_a_n_d _h_i_s_t_o_r_y _m_e_c_h_a_n_i_s_m_s, _e_t_c.). This is probably the most contentious exclusion. A common complaint about many UNIX systems is how they're not very ``user friendly.'' Some people have hoped that the interface to users could be standardized with mice, icon-based desktop metaphors, and so forth. This standard neatly sidesteps those concerns by reminding its audience that it is an application portability standard, and therefore has little relationship to the manner in which users manage their terminals. However, this guideline was not meant to apply to applications. It is perfectly reasonable for an application to assume it can have a user interacting with it. That is why such facilities as 1 displaying strings (with printf) without _s, stty, and 1 various prompting utilities are included in the standard. The interfaces in this standard are very oriented to command lines being issued by shell scripts, or through the _s_y_s_t_e_m() or _p_o_p_e_n() functions. Therefore, interactive text editors, pagers, and other user interface tools have been omitted for now. Alternatively, other standards bodies, such as X3H3.6 and the IEEE TCOS P1201 working group, are devising interfaces that could possibly be more useful and long-lived than any prescribed by POSIX.2. There is one area of this subject that will be addressed by POSIX.2. The scope of the working group has been expanded to include what is being termed the _U_s_e_r _P_o_r_t_a_b_i_l_i_t_y _E_x_t_e_n_s_i_o_n, POSIX.2a. This will be published as a supplement to this standard and have the goal of providing a portable environment for relatively expert time-sharing or software development users. It will not attempt to deal with mice or windows or other advanced interfaces at this time, but should cover many of the terminal-oriented utilities, such as a full-screen editor, currently avoided by this edition of POSIX.2. (5) _G_r_a_p_h_i_c_s _p_r_o_g_r_a_m_s _o_r _i_n_t_e_r_f_a_c_e_s. See the comments on user interface, above. (6) _T_e_x_t _f_o_r_m_a_t_t_i_n_g _p_r_o_g_r_a_m_s _o_r _l_a_n_g_u_a_g_e_s. The existing text formatting languages are generally too primitive in scope to satisfy many users, who have relied on a myriad of macro languages. There is an ISO standard text description language, SGML, but this has had insufficient Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 9 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX exposure to the UNIX system community for standardization as part of POSIX at this time. (7) _D_a_t_a_b_a_s_e _p_r_o_g_r_a_m_s _o_r _i_n_t_e_r_f_a_c_e_s (_e._g. _S_Q_L, _e_t_c.). These interfaces are the province of other standards bodies. 1.1.1.3 Language-Independent Descriptions. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The POSIX.1 {8} and POSIX.5 working groups are currently engaged in developing the model for language-independent descriptions of system services. When complete, it will allow the C language bias of the POSIX.1 {8} standard to be excised and C will take its place among other language bindings that interface with the core services descriptions. The POSIX.2 working group did not wish to duplicate effort, and has therefore waited until POSIX.1 {8} achieves progress in this area. Thus, like the first version of POSIX.1 {8}, the initial drafts of POSIX.2 start life as a C-only standard, with language independence scheduled to be included in a later draft. Fortunately, this standard is substantially less involved with C than POSIX.1 {8} is. In fact, all of the C interfaces are entirely optional. 1.1.1.4 Base Documents. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The working group consulted a number of documents in the course of its deliberations, to select utilities and features. There were five primary documents that started off the process: (1) The _S_y_s_t_e_m _V _I_n_t_e_r_f_a_c_e _D_e_f_i_n_i_t_i_o_n (_S_V_I_D), Issue 2, Volume 2. (2) The _X/_O_p_e_n _P_o_r_t_a_b_i_l_i_t_y _G_u_i_d_e, (_X_P_G), Issues II and III, Volume 1. (3) _T_h_e _U_N_I_X _U_s_e_r'_s _R_e_f_e_r_e_n_c_e _M_a_n_u_a_l, 4.3 Berkeley Software Distribution, Virtual VAX-11 Version. (The printed documentation as well as the online versions provided with the BSD ``Tahoe'' and ``Reno'' distributions were considered as one base document for the POSIX.2 work.) (4) _T_h_e _K_o_r_n_S_h_e_l_l _C_o_m_m_a_n_d _a_n_d _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e, by Bolsky and Korn. (5) _T_h_e _A_W_K _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e, by Aho, Kernighan, and Weinberger. The _X_P_G was used most heavily in initial deliberations about which utilities and features to include. The X/Open companies had done a very thorough job in analyzing the _S_V_I_D and other standards to compile a list Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 10 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 of the most useful and portable utilities. They carefully marked many features that had portability problems and the working group avoided them for this standard. AT&T, X/Open, and Berkeley provided machine-readable documentation for the use of the working group. However, due to very substantial differences in formatting standards, there is little resemblance between some of the utilities described here and their cousins in the _S_V_I_D, _X_P_G, and BSD user manual. Nevertheless, early usage of these documents was an invaluable aid in the production of the standard and the POSIX.2 working group extends its sincere thanks to all three organizations for their generous cooperation. The biggest divergence in POSIX.2's documentation has been its philosophy of fully specifying interfaces. The _S_V_I_D and _X_P_G are oriented solely towards application portability. Implementors would have a difficult time writing some of these utilities from the descriptions alone. In fact, both documents freely rely on the potential implementors licensing the source code for the reference systems to complete the specification. The POSIX.2 standard, on the other hand, also has implementors in its audience and it strove to expand its descriptions wherever useful and feasible. For example, it makes use of BNF grammars to describe complex syntaxes. It attempts to describe the interactions between options, operands, and environment variables, where conflicts can exist. It also attempts to describe all of the useful utility input and output formats. The goal here was to allow application developers to write filters or other programs that could parse the output of any of these utilities or to provide meaningful input from their programs. To the working group's knowledge, this is a task never before attempted for the historical UNIX system commands-the source code was always so readily available to anyone who really needed to know this information. The two commercial books listed were used as reference materials in preparing information on the shell and the _a_w_k language that was more recent and complete than AT&T's or X/Open's documentation. 1.1.1.5 History. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The _1_9_8_4 /_u_s_r/_g_r_o_u_p _S_t_a_n_d_a_r_d was originally intended to include the shell and user level commands. However, the /usr/group (now known as ``UniForum'') Standards Committee was unable to begin this effort, due to the complexity of the system call and library functions that it eventually did publish. A shell was referred to in the _s_y_s_t_e_m() function defined by _A_N_S_I/_X_3._1_5_9- _1_9_8_9 _P_r_o_g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e _C _S_t_a_n_d_a_r_d, but no syntax for the shell command language was attempted. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.1 Scope 11 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX As the first version of POSIX.1 {8} neared completion, it became apparent that the usefulness of POSIX would be diminished if no shell or utilities were defined. Therefore, the POSIX.2 working group was formed in January 1986 at the Denver, Colorado, meeting of POSIX.1 {8} to address this concern. The progress of the working group has seemed rather slow during the more than three years of its existence. This is primarily because its membership had substantial overlap with the POSIX.1 {8} working group; for example, the Chair of POSIX.2 was also the Technical Editor of POSIX.1 {8} (and POSIX.2 as well!) at the time. And, meetings were arbitrarily shortened to allow the POSIX.1 {8} group to move forward as quickly as possible. 1.1.1.6 Internationalization. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Some of the utilities and concepts described in this standard contain requirements that standardize multilingual and multicultural support. Most of the internationalized support for this standard was proposed by the UniForum Technical Committee Subcommittee on Internationalization, at the request of the POSIX.2 working group. UniForum, a nonprofit organization, organizes subcommittees of Technical Committees to do standards research on different topics pertinent to POSIX. The UniForum Subcommittee on Internationalization is one such group. It was formed to propose and promote standard internationalized extensions to POSIX-based systems. The POSIX.2 working group and the UniForum Subcommittee on Internationalization coordinated their work by the use of liaison members, who attended the meetings of both groups. The interaction between the two groups started when POSIX.2 asked the Subcommittee on Internationalization to provide internationalized support for regular expressions. Later, the Subcommittee on Internationalization was charged with identifying areas in the standard needing changes for internationalized support and proposing those changes. 1.1.1.7 Test Methods. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The POSIX.3 working group has worked on a test methods specification for verifying conformance to POSIX standards in general and POSIX.1 {8} and POSIX.2 in particular. Test methods for POSIX.2 should be published as a separate document1) sometime after POSIX.2 is approved. __________ 1) See the Foreword for information on the activities of other POSIX working groups. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 12 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 1.1.1.8 Organization of the Standard. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The standard document is organized into sections. Some of these, such as the Scope in 1.1, are mandated by ISO/IEC, the IEEE, and other standards bodies. The remainder of the document is organized into small sections for the convenience of the working group and others. It has been suggested that all of the utility descriptions (and maybe the functions, too) should be lumped into one large section, all in alphabetical order. This would presumably make it easier for some users to use the document as a reference document. The working group deliberately chose to not organize it in this way, for the following reasons: (1) Certain sections are optional. It is more convenient for the document's internal references, and also for people specifying systems, if these optional sections are in large pieces, rather than a detailed list of utility names. (2) Future supplements to this standard will be adding new utilities that will also be optional. It would be confusing to try to merge documents at a level below major sections (chapters). END_RATIONALE 1.2 Normative References The following standards contain provisions which, through references in this text, constitute provisions of this standard. At the time of publication, the editions indicated were valid. All standards are subject to revision, and parties to agreements based on this part of this International Standard are encouraged to investigate the possibility of applying the most recent editions of the standards listed below. Members of IEC and ISO maintain registers of currently valid International Standards. {1} ISO/IEC 646: 1983,2) _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g--_I_S_O _7-_b_i_t _c_o_d_e_d _c_h_a_r_a_c_t_e_r _s_e_t _f_o_r _i_n_f_o_r_m_a_t_i_o_n _i_n_t_e_r_c_h_a_n_g_e. __________ 2) Under revision. (This notation is meant to explicitly reference the 1990 Draft International Standard version of ISO/IEC 646.) ISO/IEC documents can be obtained from the ISO office, 1, rue de Varembe', Case Postale 56, CH-1211, Gene`ve 20, Switzerland/Suisse. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.2 Normative References 13 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX {2} ISO 1539: 1980, _P_r_o_g_r_a_m_m_i_n_g _l_a_n_g_u_a_g_e_s--_F_O_R_T_R_A_N. {3} ISO 4217: 1987, _C_o_d_e_s _f_o_r _t_h_e _r_e_p_r_e_s_e_n_t_a_t_i_o_n _o_f _c_u_r_r_e_n_c_i_e_s _a_n_d _f_u_n_d_s. {4} ISO 4873: 1986, _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g--_I_S_O _8-_b_i_t _c_o_d_e _f_o_r _i_n_f_o_r_m_a_t_i_o_n _i_n_t_e_r_c_h_a_n_g_e--_S_t_r_u_c_t_u_r_e _a_n_d _r_u_l_e _f_o_r _i_m_p_l_e_m_e_n_t_a_t_i_o_n. {5} ISO 8859-1: 1987, _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g--_8-_b_i_t _s_i_n_g_l_e-_b_y_t_e _c_o_d_e_d _g_r_a_p_h_i_c _c_h_a_r_a_c_t_e_r _s_e_t_s--_P_a_r_t _1: _L_a_t_i_n _a_l_p_h_a_b_e_t _N_o. _1. {6} ISO 8859-2: 1987, _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g--_8-_b_i_t _s_i_n_g_l_e-_b_y_t_e _c_o_d_e_d _g_r_a_p_h_i_c _c_h_a_r_a_c_t_e_r _s_e_t_s--_P_a_r_t _2: _L_a_t_i_n _a_l_p_h_a_b_e_t _N_o. _2. {7} ISO/IEC 9899: 1990, _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g _s_y_s_t_e_m_s--_P_r_o_g_r_a_m_m_i_n_g 1 _l_a_n_g_u_a_g_e_s--_C. {8} ISO/IEC 9945-1: 1990, _I_n_f_o_r_m_a_t_i_o_n _t_e_c_h_n_o_l_o_g_y--_P_o_r_t_a_b_l_e _O_p_e_r_a_t_i_n_g _S_y_s_t_e_m _I_n_t_e_r_f_a_c_e (_P_O_S_I_X)--_P_a_r_t _1: _S_y_s_t_e_m _A_p_p_l_i_c_a_t_i_o_n _P_r_o_g_r_a_m _I_n_t_e_r_f_a_c_e (_A_P_I) [_C _L_a_n_g_u_a_g_e] 1.3 Conformance 1.3.1 Implementation Conformance 1.3.1.1 Requirements A _c_o_n_f_o_r_m_i_n_g _i_m_p_l_e_m_e_n_t_a_t_i_o_n shall meet all of the following criteria: (1) The system shall support all required interfaces defined within this standard. These interfaces shall support the functional behavior described herein. The system shall provide the shell command language described in Section 3 and the utilities in Section 4. (2) The system may provide one or more of the following: the Software Development Utilities Option, the C Language Bindings Option, the C Language Development Utilities Option, the FORTRAN Development Utilities Option, or the FORTRAN Runtime Utilities Option. When an implementation claims that an optional facility is provided, all of its constituent parts shall be provided. (3) The system may provide additional or enhanced utilities, functions, or facilities not required by this standard. Nonstandard extensions should be identified as such in the system documentation. Nonstandard extensions, when used, may Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 14 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 change the behavior of utilities, functions, or facilities defined by this standard. In such cases, the implementation's conformance document (see 2.2.1.2) shall define an execution environment (i.e., shall provide general operating instructions) in which an application can be run with the behavior specified by the standard. In no case shall such an environment require modification of a Strictly Conforming POSIX.2 Application. 1.3.1.2 Documentation A conformance document with the following information shall be available for an implementation claiming conformance to this standard. The conformance document shall have the same structure as this standard, with the information presented in the appropriately numbered sections; sections that consist solely of subordinate section titles, with no other information, are not required. The conformance document shall not contain information about extended facilities or capabilities outside the scope of this standard, unless those extensions affect the behavior of a Strictly Conforming POSIX.2 Application; in such cases, the documentation required by the previous subclause shall be included. The conformance document shall contain a statement that indicates the full name, number, and date of the standard that applies. The conformance document may also list software standards approved by ISO/IEC or any ISO/IEC member body that are available for use by a Conforming POSIX.2 Application. It should indicate whether it is based on a fully- conformant POSIX.1 {8} system. Applicable characteristics where documentation is required by one of these standards, or by standards of government bodies, may also be included. The conformance document shall describe the symbolic values found in 2.13.2, stating values, the conditions under which those values can change, and the limits of such variations, if any. The conformance document shall describe the behavior of the implementation for all implementation-defined features defined in this standard. This requirement shall be met by listing these features and providing either a specific reference to the system documentation or providing full syntax and semantics of these features. When the value or behavior in the implementation is designed to be variable or customizable on each instantiation of the system, the implementation provider shall document the nature and permissible ranges of this variation. When information required by this standard is related to the underlying operating system and is already available in the POSIX.1 {8} conformance document, the implementation need not duplicate this information in the POSIX.2 conformance document, but may provide a cross-reference for this purpose. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.3 Conformance 15 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX The conformance document may specify the behavior of the implementation for those features where this standard states that implementations may vary or where features are identified as undefined or unspecified. No specifications other than those described in this subclause (1.3.1.2) shall be present in the conformance document. The phrase ``shall be documented'' in this standard means that documentation of the feature shall appear in the conformance document, as described previously, unless the system documentation is explicitly mentioned. The system documentation should also contain the information found in the conformance document. 1.3.1.3 Conforming Implementation Options The following symbolic constants, described in 2.13.2 reflect implementation options for this standard that could warrant requirement by Conforming POSIX.2 Applications, or in specifications of conforming systems, or both: {POSIX2_SW_DEV} The system supports the Software Development Utilities Option in Section 6. {POSIX2_C_BIND} The system supports the C Language Bindings Option in Annex B. {POSIX2_C_DEV} The system supports the C Language Development Utilities Option in Annex A. {POSIX2_FORT_DEV} The system supports the FORTRAN Development Utilities Option in Annex C. {POSIX2_FORT_RUN} The system supports the FORTRAN Runtime Utilities Option in Annex C. {POSIX2_LOCALEDEF} The system supports the creation of locales as described in 4.35. Additional language bindings and development utility options may be provided in other related standards or in future revisions to this standard. In the former case, additional symbolic constants of the same general form as shown in this subclause should be defined by the related standard document and made available to the application, without requiring this POSIX.2 document to be updated. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 16 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 1.3.2 Application Conformance All applications claiming conformance to this standard fall within one of the following categories: 1.3.2.1 Strictly Conforming POSIX.2 Application A Strictly Conforming POSIX.2 Application is an application that requires only the facilities described in this standard (including any required facilities of the underlying operating system; see 2.9.1). Such an application: (1) shall accept any implementation behavior that results from actions it takes in areas described in this standard as _i_m_p_l_e_m_e_n_t_a_t_i_o_n-_d_e_f_i_n_e_d or _u_n_s_p_e_c_i_f_i_e_d, or where the standard indicates that implementations may vary; (2) shall not perform any actions that are described as producing _u_n_d_e_f_i_n_e_d results; (3) for symbolic constants, shall accept any value in the range permitted by this standard, but shall not rely on any value in the range being greater than the minimums listed in this standard; (4) shall not use facilities designated as _o_b_s_o_l_e_s_c_e_n_t; (5) is required to tolerate, and is permitted to adapt to, the 1 presence or absence of optional facilities whose availability is 1 indicated by the constants in 2.13.1, or that are described 1 using the verb _m_a_y. However, an application requiring a high- 1 level language binding option can only be considered at best a Conforming POSIX.2 Application; see 1.3.2.2. Within this standard, any restrictions placed upon a Conforming POSIX.2 Application shall also restrict a Strictly Conforming POSIX.2 Application. 1.3.2.2 Conforming POSIX.2 Application The term Conforming POSIX.2 Application is used to describe either of the two following application types. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.3 Conformance 17 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 1.3.2.2.1 ISO/IEC Conforming POSIX.2 Application An ISO/IEC Conforming POSIX.2 Application is an application that uses only the facilities described in this standard (including the implied facilities of the underlying operating system; see 2.9.1) and approved conforming language bindings for any ISO/IEC standard. Such an application shall include a statement of conformance that documents all options and limit dependencies, and all other ISO/IEC standards used. 1.3.2.2.2 Conforming POSIX.2 Application A Conforming POSIX.2 Application differs from an ISO/IEC Conforming POSIX.2 Application in that it also may use specific standards of a single ISO/IEC member body referred to here as ``<_N_a_t_i_o_n_a_l _B_o_d_y>.'' Such an application shall include a statement of conformance that documents all options and limit dependencies, and all other <_N_a_t_i_o_n_a_l _B_o_d_y> standards used. 1.3.2.3 Conforming POSIX.2 Application Using Extensions A Conforming POSIX.2 Application Using Extensions is an application that differs from a Conforming POSIX.2 Application only in that it uses nonstandard facilities that are consistent with this standard. Such an application shall fully document its requirements for these extended facilities, in addition to the documentation required of a Conforming POSIX.2 Application. A Conforming POSIX.2 Application Using Extensions shall be either an ISO/IEC Conforming POSIX.2 Application Using Extensions or a Conforming POSIX.2 Application Using Extensions (see 1.3.2.2.1 and 1.3.2.2.2). BEGIN_RATIONALE 1.3.3 Conformance Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) These conformance definitions are closely related to those in POSIX.1 {8}. The terms _C_o_n_f_o_r_m_i_n_g _P_O_S_I_X._2 _A_p_p_l_i_c_a_t_i_o_n and its variants were selected to parallel the terms used in POSIX.1 {8}. The descriptions of the ISO/IEC and Conforming POSIX.2 Applications are similar to the same descriptions in POSIX.1 {8}. This is not a duplication of effort, as this standard relies on only a portion of POSIX.1 {8}, as explained in 1.1 and 2.9.1. Therefore conformance to POSIX.2 has to be described separately from any conformance options or requirements in POSIX.1 {8}. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 18 1 General Part 2: SHELL AND UTILITIES P1003.2/D11.2 A reference to a Language-Independent System Services Option was removed from the list of optional features that may be provided by the conforming implementation. There is no conformance value provided by that section, except as a reference point for functions actually provided by a real language binding. Therefore, the language binding sections are the ones that remain in the optional list. The Draft 8 section Language-Dependent Services for the C Programming Language was removed, as this subject is adequately, and appropriately, covered in Annex A. The documentation requirement for implementation extensions (``shall define an execution environment'') is simply meant to require that system-wide or per-user configuration options or environment variables that affect the operation of applications that use the standard utilities and functions be described in the conformance document. For example, if setting the (imaginary) LC_TRUTH variable causes changes in the exit status of true, the conformance document must describe this condition and how to avoid it--say, by unsetting the variable in the login script. For further rationale on the types of conformance, see the POSIX.1 {8} Rationale. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 1.3 Conformance 19 P1003.2/D11.2 Section 2: Terminology and General Requirements 2.1 Conventions 2.1.1 Editorial Conventions This standard uses the following editorial and typographical conventions. A summary of typographical conventions is shown in Table 2-1. The Bold Courier font is used to show brackets that denote optional arguments in a utility synopsis, as in cut [-_c _l_i_s_t] [_f_i_l_e__n_a_m_e] These brackets shall not be used by the application unless they are specifically mentioned as literal input characters by the utility description. There are two types of symbols enclosed in angle brackets (< >): C-Language Headers The header name is in the Courier font, such as . When coding C programs, the brackets are used as required by the language. Parameters Parameters, also called _m_e_t_a_v_a_r_i_a_b_l_e_s, are in italics, such as <_d_i_r_e_c_t_o_r_y _p_a_t_h_n_a_m_e>. The entire symbol, including the brackets, is meant to be replaced by the value of the symbol described within the brackets. Numbers within braces, such as ``POSIX.1 {8},'' represent cross references to the Normative References clause (see 1.2). If the number is preceded by a B, it represents a Bibliographic entry (see Annex D). Bibliographic entries are for information only. In some examples, the Bold Courier font is used to indicate the system's output that resulted from some user input, shown in Courier. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.1 Conventions 21 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Table 2-1 - Typographical Conventions __________________________________________________________________________________________________________________________________________________ Reference Example ___________________________________________________________________ C-Language Data Type _l_o_n_g C-Language Function _s_y_s_t_e_m() C-Language Function Argument _a_r_g_1 C-Language Global External _e_r_r_n_o C-Language Header C-Language Keyword #define Cross Reference: Annex Annex A Cross Reference: Clause 2.3 Cross Reference: Other Standard ISO 9999-1 {_n} Cross Reference: Section Section 2 Cross Reference: Subclause 2.3.4, 2.3.4.5, 2.3.4.5.6 Defined Term (see text) Environment Variable PATH Error Number [EINTR] Example Input echo foo Example Output foo Figure Reference Figure 7 File Name /tmp Parameter <_d_i_r_e_c_t_o_r_y _p_a_t_h_n_a_m_e> Special Character Symbolic Constant, Limit {_POSIX_VDISABLE}, {LINE_MAX} Table Reference Table 6 Utility Name awk Utility Operand _f_i_l_e__n_a_m_e Utility Option -c Utility Option with Option-Argument -w _w_i_d_t_h __________________________________________________________________________________________________________________________________________________ Defined terms are shown in three styles, depending on context: (1) Terms defined in 2.2.1, 2.2.2, and 3.1 are expressed as subclause titles. Alternative forms of the terms appear in [brackets]. (2) The initial appearances of other terms, applying to a limited portion of the text, are in _i_t_a_l_i_c_s. (3) Subsequent appearances of the term are in the Roman font. Symbolic constants are shown in two styles: those within curly braces are intended to call the reader's attention to values in and ; those without braces are usually defined by one or a few related functions. There is no semantic difference between these two Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 22 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 forms of presentation. Filenames and pathnames are shown in Courier. When a pathname is shown starting with ``$HOME/'', this indicates the remaining components of the pathname are to be related to the directory named by the user's HOME environment variable. The style selected for some of the special characters, such as , matches the form of the input given to the localedef utility (see 2.5.2). Generally, the characters selected for this special treatment are those that are not visually distinct, such as the control characters or . Literal characters and strings used as input or output are shown in various ways, depending on context: %, begin When no confusion would result, the character or string is rendered in the Courier font and used directly in the text. 'c' In some cases a character is enclosed in single-quote characters, similar to a C-language character constant. Unless otherwise noted, the quotes shall not be used as input or output. "string" In some cases, a string is enclosed in double-quote characters, similar to a C-language string constant. Unless otherwise noted, the quotes shall not be used as input or output. Defined names that are usually in lowercase, particularly function names, are never used at the beginning of a sentence or anywhere else that regular English usage would require them to be capitalized. Parenthetical expressions within normative text also contain normative information. The general typographic hierarchy of parenthetical expressions is: { [ ( ) ] } The square brackets are most frequently used to enclose a parenthetical expression that contains a function name [such as _w_a_i_t_p_i_d()], with its built-in parentheses. In some cases, tabular information is presented inline; in others it is presented in a separately-labeled Table. This arrangement was employed purely for ease of reference and there is no normative difference between these two cases. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.1 Conventions 23 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Annexes marked as _n_o_r_m_a_t_i_v_e are parts of the standard that pose requirements, exactly the same as the numbered Sections, but have been moved to near the end of the document for clarity of exposition. _I_n_f_o_r_m_a_t_i_v_e Annexes are for information only and pose no requirements. All material preceding page 1 of the document (the ``front matter'') and the two indexes at the end are also only informative. NOTES that appear in a smaller point size and are indented have one of two different meanings, depending on their location: - When they are within the normal text of the document, they are the same as footnotes--informative, posing no requirements on implementations or applications. - When they are attached to Tables or Figures, they are normative, posing requirements. Text marked as examples (including the use of ``e.g.'') is for information only. The exception to this comes in the C-language programs and program fragments used to represent algorithms, as described in 2.1.3. The typographical conventions listed here are for ease of reading only. Editorial inconsistencies in the use of typography are unintentional and have no normative meaning in this standard. 2.1.2 Grammar Conventions Portions of this standard are expressed in terms of a special grammar notation. It is used to portray the complex syntax of certain program input. The grammar is based on the syntax used by the yacc utility (see A.3). However, it does not represent fully functional yacc input, suitable for program use: the lexical processing and all semantic requirements are described only in textual form. The grammar is not based on source used in any traditional implementation and has not been tested with the semantic code that would normally be required to accompany it. Furthermore, there is no implication that the partial yacc code presented represents the most efficient, or only, means of supporting the complex syntax within the utility. Implementations may use other programming languages or algorithms, as long as the syntax supported is the same as that represented by the grammar. The following typographical conventions are used in the grammar; they have no significance except to aid in reading. - The identifiers for the reserved words of the language are shown with a leading capital letter. (These are terminals in the grammar. Examples: While, Case.) Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 24 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 - The identifiers for terminals in the grammar are all named with 1 uppercase letters and underscores. Examples: NEWLINE, ASSIGN_OP, 1 NAME. 1 - The identifiers for nonterminals are all lowercase. 2.1.3 Miscellaneous Conventions This standard frequently uses the C language to express algorithms in terms of programs or program fragments. The following shall be considered in reading this code: - The programs use the syntax and semantics described by the C Standard {7}. - The programs are merely examples and do not represent the most efficient, or only, means of coding the interface. Implementations may use other programming languages or algorithms, as long as the results are the same as those achieved by the programs in this standard. - C-language comments are informative and pose no requirements. Further conventions are presented in: - Utility Conventions, 2.10, describing utility and application command-line syntax - File Format Notation, 2.12, describing the notation used to represent utility input and output 2.1.4 Conventions Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The C language was chosen for many examples because: - It eliminates any requirement to document a different pseudocode. - It is a familiar language to many of the potential readers of POSIX.2. - It is the language most widely used for historical implementations of the utilities. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.1 Conventions 25 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2 Definitions 2.2.1 Terminology For the purposes of this standard, the following definitions apply: 2.2.1.1 can: The word _c_a_n is to be interpreted as describing a permissible optional feature or behavior available to the application; the implementation shall support such features or behaviors as mandatory requirements. 2.2.1.2 conformance document: A document provided by an implementor that contains implementation details as described in 1.3.1.2. 2.2.1.3 implementation: An object providing to applications and users the services defined by this standard. The word _i_m_p_l_e_m_e_n_t_a_t_i_o_n is to be interpreted to mean that object, after it has been modified in accordance with the manufacturer's instructions to: - configure it for conformance with this standard; - select some of the various optional facilities described by this standard, through customization by local system administrators or operators. An exception to this meaning occurs when discussing conformance documentation or using the term _i_m_p_l_e_m_e_n_t_a_t_i_o_n _d_e_f_i_n_e_d. See 2.2.1.4 and 1.3.1.2. 2.2.1.4 implementation defined: When a value or behavior is described by this standard as _i_m_p_l_e_m_e_n_t_a_t_i_o_n _d_e_f_i_n_e_d, the implementation provider shall document the requirements for correct program construction and correct data in the use of that value or behavior. When the value or behavior in the implementation is designed to be variable or customizable on each instantiation of the system, the implementation provider shall document the nature and permissible ranges of this variation. (See 1.3.1.2.) 2.2.1.5 may: The word _m_a_y is to be interpreted as describing an optional feature or behavior of the implementation that is not required by this standard, but there is no prohibition against providing it. A 1 Strictly Conforming POSIX.2 Application is permitted to use such 1 features, but shall not rely on the implementation's actions in such 1 cases. To avoid ambiguity, the reverse sense of _m_a_y is not expressed as 1 _m_a_y _n_o_t, but as _n_e_e_d _n_o_t. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 26 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.1.6 obsolescent: Certain features are _o_b_s_o_l_e_s_c_e_n_t, which means that they may be considered for withdrawal in future revisions of this standard. They are retained in this version because of their widespread use. Their use in new applications is discouraged. 2.2.1.7 shall: In this standard, the word _s_h_a_l_l is to be interpreted as a requirement on the implementation or on Strictly Conforming POSIX.2 Applications, where appropriate. 2.2.1.8 should: With respect to implementations, the word _s_h_o_u_l_d is to be interpreted as an implementation recommendation, but not a requirement. With respect to applications, the word _s_h_o_u_l_d is to be interpreted as recommended programming practice for applications and a requirement for Strictly Conforming POSIX.2 Applications. 2.2.1.9 system documentation: All documentation provided with an implementation, except the conformance document. Electronically distributed documents for an implementation are considered part of the system documentation. 2.2.1.10 undefined: A value or behavior is _u_n_d_e_f_i_n_e_d if the standard imposes no portability requirements on applications for erroneous program construction, erroneous data, or use of an indeterminate value. Implementations (or other standards) may specify the result of using that value or causing that behavior. An application using such behaviors is using extensions, as defined in 1.3.2.3. 2.2.1.11 unspecified: A value or behavior is _u_n_s_p_e_c_i_f_i_e_d if the standard imposes no portability requirements on applications for a correct program construction or correct data. Implementations (or other standards) may specify the result of using that value or causing that behavior. An application requiring a specific behavior, rather than tolerating any behavior when using that functionality, is using extensions, as defined in 1.3.2.3. BEGIN_RATIONALE 2.2.1.12 Terminology Rationale (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Most of these terms were adapted from their POSIX.1 {8} counterparts with little modification. The reader is referred to the definition of _p_r_o_g_r_a_m in 2.2.2.119 to understand the expression ``program construction.'' The use of _p_r_o_g_r_a_m in this standard is differentiated from POSIX.1 {8}'s emphasis only on high level languages by this standard's broader concern with utility and Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 27 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX command language interactions. Included in the scope of program construction are: (1) Shell command language (2) Command arguments (3) Regular expressions, of various types (4) Command input language syntax, such as awk, bc, ed, lex, make, sed, and yacc. Some of these are so complex that they rival traditional high level languages. The usage of _c_a_n and _m_a_y were selected to contrast optional application behavior (can) against optional implementation behavior (may). The term _s_u_p_p_o_r_t_e_d was removed from Draft 8; it had originally been copied from the POSIX.1 {8} document, but it later became clear that its requirement for function ``stubs'' for unsupported functions made little sense in this standard. The term _s_u_p_p_o_r_t therefore reverts to its English-language meaning. The term _o_b_s_o_l_e_s_c_e_n_t was changed to _d_e_p_r_e_c_a_t_e_d in some earlier drafts, but it was restored to match POSIX.1 {8}'s use of the term. It means ``do not use this feature in new applications.'' The obsolescence concept is not an ideal solution, but was used as a method of increasing consensus: many more objections would be heard from the user community if some of these historical features were suddenly withdrawn without the grace period obsolescence implies. The phrase ``may be considered for withdrawal in future revisions'' implies that the result of that consideration might in fact keep those features indefinitely if the predominance of applications does not migrate away from them quickly. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 28 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2 General Terms For the purposes of this standard, the following definitions apply. 2.2.2.1 absolute pathname: See _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. 2.2.2.2 address space: The memory locations that can be referenced by a process. [POSIX.1 {8}] 2.2.2.3 affirmative response: An input string that matches one of the responses acceptable to the LC_MESSAGES category keyword yesexpr, matching an extended regular expression in the current locale; see 2.5. 2.2.2.4 : A character that in the output stream shall indicate 1 that a terminal should alert its user via a visual or audible 1 notification. The shall be the character designated by '\a' in the C language binding. It is unspecified whether this character is the exact sequence transmitted to an output device by the system to accomplish the alert function. 2.2.2.5 angle brackets: The characters ``<'' (_l_e_f_t-_a_n_g_l_e-_b_r_a_c_k_e_t) and ``>'' (_r_i_g_h_t-_a_n_g_l_e-_b_r_a_c_k_e_t). When used in the phrase ``enclosed in angle brackets'' the symbol ``<'' shall immediately precede the object to be enclosed, and ``>'' shall immediately follow it. When describing these characters in 2.4, the names and are used. 2.2.2.6 appropriate privileges: An implementation-defined means of associating privileges with a process with regard to the function calls and function call options defined in POSIX.1 {8} that need special privileges. There may be zero or more such means. [POSIX.1 {8}] 2.2.2.7 argument: A parameter passed to a utility as the equivalent of a single string in the _a_r_g_v array created by one of the POSIX.1 {8} _e_x_e_c functions. See 2.10.1 and 3.9.1.1. An argument is one of the options, option- arguments, or operands following the command name. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 29 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.8 asterisk: The character ``*''. 2.2.2.9 background process: A process that is a member of a background process group. [POSIX.1 {8}] 2.2.2.10 background process group: Any process group, other than a foreground process group, that is a member of a session that has established a connection with a controlling terminal. [POSIX.1 {8}] 2.2.2.11 backquote: The character ```'', also known as a _g_r_a_v_e _a_c_c_e_n_t. 2.2.2.12 backslash: The character ``\'', also known as a _r_e_v_e_r_s_e _s_o_l_i_d_u_s. 2.2.2.13 : A character that normally causes printing (or displaying) to occur one column position previous to the position about to be printed. The shall be the character designated by '\b' in the C language binding. It is unspecified whether this character is the exact sequence transmitted to an output device by the system to accomplish the backspace function. The character defined here is not necessarily the ERASE special character defined in POSIX.1 {8} 7.1.1.9. 2.2.2.14 basename: The final, or only, filename in a pathname. 2.2.2.15 basic regular expression: A pattern (sequence of characters or symbols) constructed according to the rules defined in 2.8.3. 2.2.2.16 : One of the characters that belong to the blank character class as defined via the LC_CTYPE category in the current locale. In the POSIX Locale, a is either a or a . 2.2.2.17 blank line: A line consisting solely of zero or more s terminated by a . See also _e_m_p_t_y _l_i_n_e (2.2.2.44). 2.2.2.18 block special file: A file that refers to a device. A block special file is normally distinguished from a character special file by providing access to the device in a manner such that the hardware Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 30 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 characteristics of the device are not visible. [POSIX.1 {8}] 2.2.2.19 braces: The characters ``{'' (_l_e_f_t _b_r_a_c_e) and ``}'' (_r_i_g_h_t _b_r_a_c_e), also known as _c_u_r_l_y _b_r_a_c_e_s. When used in the phrase ``enclosed in (curly) braces'' the symbol ``{'' shall immediately precede the object to be enclosed, and ``}'' shall immediately follow it. When describing these characters in 2.4, the names and are used. 2.2.2.20 brackets: The characters ``['' (_l_e_f_t-_b_r_a_c_k_e_t) and ``]'' (_r_i_g_h_t-_b_r_a_c_k_e_t), also known as _s_q_u_a_r_e _b_r_a_c_k_e_t_s. When used in the phrase ``enclosed in (square) brackets'' the symbol ``['' shall immediately precede the object to be enclosed, and ``]'' shall immediately follow it. When describing these characters in 2.4, the names and are used. 2.2.2.21 built-in utility: A utility implemented within a shell. The utilities referred to as _s_p_e_c_i_a_l _b_u_i_l_t-_i_n_s have special qualities, described in 3.14. Unless qualified, the term _b_u_i_l_t-_i_n includes the special built-in utilities. The utilities referred to as _r_e_g_u_l_a_r _b_u_i_l_t-_i_n_s are those named in Table 2-2. As indicated in 2.3, there is no requirement that these utilities be actually built into the shell on the implementation, but that they do have special command-search qualities. 2.2.2.22 byte: An individually addressable unit of data storage that is 1 equal to or larger than an octet, used to store a character or a portion 1 of a character; see 2.2.2.24. 1 A byte is composed of a contiguous sequence of bits, the number of which 1 is implementation defined. The least significant bit is called the _l_o_w- _o_r_d_e_r bit; the most significant is called the _h_i_g_h-_o_r_d_e_r bit. [POSIX.1 {8}] NOTE: This definition of _b_y_t_e is actually from the C Standard {7} because POSIX.1 {8} merely references it without copying the text. It 1 has been reworded slightly to clarify its intent without introducing the 1 C Standard {7} terminology ``basic execution character set,'' which is 1 inapplicable to this standard. It deviates intentionally from the usage 1 of _b_y_t_e in some other standards, where it is used as a synonym for _o_c_t_e_t 1 (always eight bits). On a POSIX.1 {8} system, a byte may be larger than 1 eight bits so that it can be an integral portion of larger data objects 1 that are not evenly divisible by eight bits (such as a 36-bit word that 1 contains 4 9-bit bytes). 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 31 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.23 : A character that in the output stream shall 1 indicate that printing should start at the beginning of the same physical line in which the occurred. The shall be the character designated by '\r' in the C language binding. It is unspecified whether this character is the exact sequence transmitted to an output device by the system to accomplish the movement to the beginning of the line. 2.2.2.24 character: A sequence of one or more bytes representing a single graphic symbol. NOTE: This term corresponds in the C Standard {7} to the term _m_u_l_t_i_b_y_t_e _c_h_a_r_a_c_t_e_r, noting that a single-byte character is a special case of multibyte character. Unlike the usage in the C Standard {7}, _c_h_a_r_a_c_t_e_r here has no necessary relationship with storage space, and _b_y_t_e is used when storage space is discussed. [POSIX.1 {8}] (See 2.4 for a further explanation of the graphical representations of characters, or ``glyphs,'' versus character encodings.) 2.2.2.25 character class: A named set of characters sharing an attribute associated with the name of the class. The classes and the characters that they contain are dependent on the value of the LC_CTYPE category in the current locale; see 2.5. 2.2.2.26 character special file: A file that refers to a device. One specific type of character special file is a terminal device file, whose access is defined in POSIX.1 {8} section 7.1. Other character special files have no structure defined by this standard, and their use is unspecified by this standard. [POSIX.1 {8}] 2.2.2.27 circumflex: The character ``^''. 2.2.2.28 collating element: The smallest entity used to determine the logical ordering of strings. See _c_o_l_l_a_t_i_o_n _s_e_q_u_e_n_c_e (2.2.2.30). A collating element shall consist of either a single character, or two or more characters collating as a single entity. The value of the LC_COLLATE category in the current locale determines the current set of collating elements. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 32 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.29 collation: The logical ordering of strings according to defined precedence rules. These rules identify a collation sequence between the collating elements, and such additional rules that can be used to order strings consisting of multiple collating elements. 2.2.2.30 collation sequence: The relative order of collating elements as determined by the setting of the LC_COLLATE category in the current locale. The character order, as defined for the LC_COLLATE category in the 2 current locale (see 2.5.2.2), defines the relative order of all collating 2 elements, such that each element occupies a unique position in the order. 2 In addition, one or more collation weights can be assigned for each 2 collating element; these weights are used to determine the relative order 2 of strings in, e.g., the sort utility. 2 Multilevel sorting is accomplished by assigning elements one or more collation weights, up to the limit {COLL_WEIGHTS_MAX}. On each level, elements may be given the same weight (at the primary level, called an 1 _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s; see 2.2.2.47) or be omitted from the sequence. Strings that collate equal using the first assigned weight (primary ordering), are then compared using the next assigned weight (secondary ordering), and so on. 2.2.2.31 column position: A unit of horizontal measure related to characters in a line. 2 It is assumed that each character in a character set has an intrinsic 2 column width independent of any output device. Each printable character 2 in the portable character set has a column width of one. The standard 2 utilities, when used as described in this standard, assume that all 2 characters have integral column widths. The column width of a character 2 is not necessarily related to the internal representation of the 2 character (numbers of bits or octets). 2 The column position of a character in a line is defined as one plus the 2 sum of the column widths of the preceding characters in the line. Column 2 positions are numbered starting from 1. 2.2.2.32 command: A directive to the shell to perform a particular task; see 3.9. 2.2.2.33 current working directory: See _w_o_r_k_i_n_g _d_i_r_e_c_t_o_r_y in 2.2.2.159. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 33 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.34 command language interpreter: See 2.2.2.133. 2.2.2.35 directory: A file that contains directory entries. No two directory entries in the same directory shall have the same name. [POSIX.1 {8}] 2.2.2.36 directory entry [link]: An object that associates a filename with a file. Several directory entries can associate names with the same file. [POSIX.1 {8}] 2.2.2.37 dollar-sign: The character ``$''. This standard permits the substitution of the ``currency symbol'' graphic defined in ISO/IEC 646 {1} for this symbol when the character set being used has substituted that graphic for the graphic $. The graphic symbol $ is always used in this standard, but not in any monetary sense. 2.2.2.38 dot: The filename consisting of a single dot character (.). See _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. [POSIX.1 {8}] In the context of shell special built-in utilities, see 3.14.4. 2.2.2.39 dot-dot: The filename consisting solely of two dot characters (..). See _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. [POSIX.1 {8}] 2.2.2.40 double-quote: The character ``"'', also known as _q_u_o_t_a_t_i_o_n- _m_a_r_k. 2.2.2.41 effective group ID: An attribute of a process that is used in determining various permissions, including file access permissions, described in 2.2.2.55. See _g_r_o_u_p _I_D. This value is subject to change during the process lifetime, as described in POSIX.1 {8} 3.1.2 (_e_x_e_c) and 4.2.2 [_s_e_t_g_i_d()]. [POSIX.1 {8}] 2.2.2.42 effective user ID: An attribute of a process that is used in determining various permissions, including file access permissions. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 34 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 See _u_s_e_r _I_D. This value is subject to change during the process lifetime, as described in POSIX.1 {8} 3.1.2 (_e_x_e_c) and 4.2.2 [_s_e_t_u_i_d()]. [POSIX.1 {8}] 2.2.2.43 empty directory: A directory that contains, at most, directory entries for dot and dot-dot. [POSIX.1 {8}] 2.2.2.44 empty line: A line consisting of only a character. See also _b_l_a_n_k _l_i_n_e (2.2.2.17). 2.2.2.45 empty string [null string]: A character array whose first element is a null character. [POSIX.1 {8}] 2.2.2.46 Epoch: The time 0 hours, 0 minutes, 0 seconds, January 1, 1970, Coordinated Universal Time. See _s_e_c_o_n_d_s _s_i_n_c_e _t_h_e _E_p_o_c_h. [POSIX.1 {8}] 2.2.2.47 equivalence class: A set of collating elements with the same 1 primary collation weight. 1 Elements in an equivalence class are typically elements that naturally group together, such as all accented letters based on the same base letter. The collation order of elements within an equivalence class is determined 1 by the weights assigned on any subsequent levels after the primary 1 weight. 1 2.2.2.48 executable file: A regular file acceptable as a new process image file by the equivalent of the POSIX.1 {8} _e_x_e_c family of functions, and thus usable as one form of a utility. See _e_x_e_c in POSIX.1 {8} 3.1.2. The standard utilities described as compilers can produce executable files, but other unspecified methods of producing executable files may also be provided. The internal format of an executable file is unspecified, but a conforming application shall not assume an executable file is a text file. 2.2.2.49 execute: To perform the actions described in 3.9.1.1. See also _i_n_v_o_k_e (2.2.2.79). Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 35 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.50 extended regular expression: A pattern (sequence of characters or symbols) constructed according to the rules defined in 2.8.4. 2.2.2.51 extended security controls: A concept of the underlying system, as follows. [POSIX.1 {8}] The access control (see _f_i_l_e _a_c_c_e_s_s _p_e_r_m_i_s_s_i_o_n_s) and privilege (see _a_p_p_r_o_p_r_i_a_t_e _p_r_i_v_i_l_e_g_e_s in 2.2.2.6) mechanisms have been defined to allow implementation-defined extended security controls. These permit an implementation to provide security mechanisms to implement different security policies than described in POSIX.1 {8}. These mechanisms shall not alter or override the defined semantics of any of the functions in POSIX.1 {8}. 2.2.2.52 feature test macro: A #defined symbol used to determine whether a particular set of features will be included from a header. See POSIX.1 {8} 2.7.1. [POSIX.1 {8}] 2.2.2.53 FIFO special file [FIFO]: A type of file with the property that data written to such a file is read on a first-in-first-out basis. Other characteristics of _F_I_F_Os are described in POSIX.1 {8} 5.3.1 [_o_p_e_n()], 6.4.1 [_r_e_a_d()], 6.4.2 [_w_r_i_t_e()], and 6.5.3 [_l_s_e_e_k()]. [POSIX.1 {8}] 2.2.2.54 file: An object that can be written to, or read from, or both. A file has certain attributes, including access permissions and type. File types include regular file, character special file, block special file, FIFO special file, and directory. Other types of files may be defined by the implementation. [POSIX.1 {8}] 2.2.2.55 file access permissions: A concept of the underlying system, as follows. [POSIX.1 {8}] The standard file access control mechanism uses the file permission bits, as described below. These bits are set at file creation by _o_p_e_n(), _c_r_e_a_t(), _m_k_d_i_r(), and _m_k_f_i_f_o() and are changed by _c_h_m_o_d(). These bits are read by _s_t_a_t() or _f_s_t_a_t(). Implementations may provide _a_d_d_i_t_i_o_n_a_l or _a_l_t_e_r_n_a_t_e file access control mechanisms, or both. An additional access control mechanism shall only further restrict the access permissions defined by the file permission bits. An alternate access control mechanism shall: Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 36 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 (1) Specify file permission bits for the file owner class, file group class, and file other class of the file, corresponding to the access permissions, to be returned by _s_t_a_t() or _f_s_t_a_t(). (2) Be enabled only by explicit user action, on a per-file basis by the file owner or a user with the appropriate privilege. (3) Be disabled for a file after the file permission bits are changed for that file with _c_h_m_o_d(). The disabling of the alternate mechanism need not disable any additional mechanisms defined by an implementation. Whenever a process requests file access permission for read, write, or execute/search, if no additional mechanism denies access, access is determined as follows: (1) If a process has the appropriate privilege: (a) If read, write, or directory search permission is requested, access is granted. (b) If execute permission is requested, access is granted if execute permission is granted to at least one user by the file permission bits or by an alternate access control mechanism; otherwise, access is denied. (2) Otherwise: (a) The file permission bits of a file contain read, write, and execute/search permissions for the file owner class, file group class, and file other class. (b) Access is granted if an alternate access control mechanism is not enabled and the requested access permission bit is set for the class (file owner class, file group class, or file other class) to which the process belongs, or if an alternate access control mechanism is enabled and it allows the requested access; otherwise, access is denied. 2.2.2.56 file descriptor: A per-process unique, nonnegative integer used to identify an open file for the purpose of file access. [POSIX.1 {8}] 2.2.2.57 file group class: The property of a file indicating access permissions for a process related to the process's group identification. A process is in the file group class of a file if the process is not in the file owner class and if the effective group ID or one of the Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 37 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX supplementary group IDs of the process matches the group ID associated with the file. Other members of the class may be implementation defined. [POSIX.1 {8}] 2.2.2.58 file hierarchy: A concept of the underlying system, as follows. [POSIX.1 {8}] Files in the system are organized in a hierarchical structure in which all of the nonterminal nodes are directories and all of the terminal nodes are any other type of file. Because multiple directory entries may refer to the same file, the hierarchy is properly described as a ``directed graph.'' 2.2.2.59 file mode: An object containing the file permission bits and other characteristics of a file, as described in POSIX.1 {8} 5.6.1. [POSIX.1 {8}] 2.2.2.60 file mode bits: A file's file permission bits, set-user-ID- on-execution bit (S_ISUID), and set-group-ID-on-execution bit (S_ISGID) (see POSIX.1 {8} 5.6.1.2). 2.2.2.61 filename: A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the slash character and the null character. The filenames dot and dot-dot have special meaning; see _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. A filename is sometimes referred to as a pathname component. [POSIX.1 {8}] 2.2.2.62 filename portability: A concept of the underlying system, as follows. [POSIX.1 {8}] Filenames should be constructed from the portable filename character set because the use of other characters can be confusing or ambiguous in certain contexts. 2.2.2.63 file offset: The byte position in the file where the next I/O operation begins. Each open file description associated with a regular file, block special file, or directory has a file offset. A character special file that does not refer to a terminal device may have a file offset. There is no file offset specified for a pipe or FIFO. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 38 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.64 file other class: The property of a file indicating access permissions for a process related to the process's user and group identification. A process is in the file other class of a file if the process is not in the file owner class or file group class. [POSIX.1 {8}] 2.2.2.65 file owner class: The property of a file indicating access permissions for a process related to the process's user identification. A process is in the file owner class of a file if the effective user ID of the process matches the user ID of the file. [POSIX.1 {8}] 2.2.2.66 file permission bits: Information about a file that is used, along with other information, to determine if a process has read, write, or execute/search permission to a file. The bits are divided into three parts: owner, group, and other. Each part is used with the corresponding file class of processes. These bits are contained in the file mode, as described in POSIX.1 {8} 5.6.1. The detailed usage of the file permission bits in access decisions is described in _f_i_l_e _a_c_c_e_s_s _p_e_r_m_i_s_s_i_o_n_s in 2.2.2.55. [POSIX.1 {8}] 2.2.2.67 file serial number: A per-file-system unique identifier for a file. File serial numbers are unique throughout a file system. [POSIX.1 {8}] 2.2.2.68 file system: A collection of files and certain of their attributes. It provides a name space for file serial numbers referring to those files. [POSIX.1 {8}] 2.2.2.69 file times update: A concept of the underlying system, as follows. [POSIX.1 {8}] Each file has three distinct associated time values: _s_t__a_t_i_m_e, _s_t__m_t_i_m_e, and _s_t__c_t_i_m_e. The _s_t__a_t_i_m_e field is associated with the times that the file data is accessed; _s_t__m_t_i_m_e is associated with the times that the file data is modified; and _s_t__c_t_i_m_e is associated with the times that file status is changed. These values are returned in the file characteristics structure, as described in POSIX.1 {8} 5.6.1. Any function in this standard that is required to read or write file data or change the file status indicates which of the appropriate time-related fields are to be ``marked for update.'' If an implementation of such a Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 39 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX function marks for update a time-related field not specified by this standard, this shall be documented, except that any changes caused by pathname resolution need not be documented. For the other functions in this standard (those that are not explicitly required to read or write file data or change file status, but that in some implementations happen to do so), the effect is unspecified. An implementation may update fields that are marked for update immediately, or it may update such fields periodically. When the fields are updated, they are set to the current time and the update marks are cleared. All fields that are marked for update shall be updated when the file is no longer open by any process, or when a _s_t_a_t() or _f_s_t_a_t() is performed on the file. Other times at which updates are done are unspecified. Updates are not done for files on read-only file systems. 2.2.2.70 file type: See _f_i_l_e in 2.2.2.54. 2.2.2.71 filter: A command whose operation consists of reading data from standard input or a list of input files and writing data to standard output. Typically, its function is to perform some transformation on the data stream. 2.2.2.72 foreground process: A process that is a member of a foreground process group. [POSIX.1 {8}] 2.2.2.73 foreground process group: A process group whose member processes have certain privileges, denied to processes in background process groups, when accessing their controlling terminal. Each session that has established a connection with a controlling terminal has exactly one process group of the session as the foreground process group of that controlling terminal. See POSIX.1 {8} 7.1.1.4. [POSIX.1 {8}] 2.2.2.74 : A character that in the output stream shall 1 indicate that printing should start on the next page of an output device. The shall be the character designated by '\f' in the C language binding. If is not the first character of an output line, the result is unspecified. It is unspecified whether this character is the exact sequence transmitted to an output device by the system to accomplish the movement to the next page. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 40 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.75 group ID: A nonnegative integer, which can be contained in an object of type _g_i_d__t, that is used to identify a group of system users. Each system user is a member of at least one group. When the identity of a group is associated with a process, a group ID value is referred to as a real group ID, an effective group ID, one of the (optional) supplementary group IDs, or an (optional) saved set-group-ID. [POSIX.1 {8}] 2.2.2.76 hard link: The relationship between two directory entries that represent the same file; the result of an execution of the ln utility or the POSIX.1 {8} _l_i_n_k() function. 2.2.2.77 home directory: The current directory associated with a user at the time of login. 2.2.2.78 incomplete line: A sequence of text consisting of one or more non- characters at the end of the file. 2.2.2.79 invoke: To perform the actions described in 3.9.1.1, except that searching for shell functions and special built-ins is suppressed. See also _e_x_e_c_u_t_e (2.2.2.49). 2.2.2.80 job control: A facility that allows users to selectively stop (suspend) the execution of processes and continue (resume) their execution at a later point. The user typically employs this facility via the interactive interface jointly supplied by the terminal I/O driver and a command interpreter. POSIX.1 {8} conforming implementations may optionally support job control facilities; the presence of this option is indicated to the application at compile time or run time by the definition of the {_POSIX_JOB_CONTROL} symbol; see POSIX.1 {8} 2.9. [POSIX.1 {8}] 2.2.2.81 line: A sequence of text consisting of zero or more non- characters plus a terminating character. 2.2.2.82 link: See _d_i_r_e_c_t_o_r_y _e_n_t_r_y in 2.2.2.36. 2.2.2.83 link count: The number of directory entries that refer to a particular file. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 41 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.84 locale: The definition of the subset of a user's environment that depends on language and cultural conventions; see 2.5. 2.2.2.85 login: The unspecified activity by which a user gains access to the system. Each login shall be associated with exactly one login name. [POSIX.1 {8}] 2.2.2.86 login name: A user name that is associated with a login. [POSIX.1 {8}] 2.2.2.87 mode: A collection of attributes that specifies a file's type and its access permissions. See _f_i_l_e _a_c_c_e_s_s _p_e_r_m_i_s_s_i_o_n_s in 2.2.2.55. [POSIX.1 {8}] 2.2.2.88 multicharacter collating element: A sequence of two or more characters that collate as an entity. For example, in some coded character sets, an accented character is represented by a (nonspacing) accent, followed by the letter. Another example is the Spanish elements ``ch'' and ``ll.'' 2.2.2.89 negative response: An input string that matches one of the responses acceptable to the LC_MESSAGES category keyword noexpr, matching an extended regular expression in the current locale. See 2.5. 2.2.2.90 : A character that in the output stream shall 1 indicate that printing should start at the beginning of the next line. The shall be the character designated by '\n' in the C language binding. It is unspecified whether this character is the exact sequence transmitted to an output device by the system to accomplish the movement to the next line. 2.2.2.91 NUL: A character with all bits set to zero. 2.2.2.92 null string: See _e_m_p_t_y _s_t_r_i_n_g in 2.2.2.45. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 42 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.93 number-sign: The character ``#''. This standard permits the substitution of the ``pound sign'' graphic defined in ISO/IEC 646 {1} for this symbol when the character set being used has substituted that graphic for the graphic #. The graphic symbol # is always used in this standard. 2.2.2.94 object file: A regular file containing the output of a compiler, formatted as input to a linkage editor for linking with other object files into an executable form. The methods of linking are unspecified and may involve the dynamic linking of objects at run-time. The internal format of an object file is unspecified, but a conforming application shall not assume an object file is a text file. 2.2.2.95 open file: A file that is currently associated with a file descriptor. [POSIX.1 {8}] 2.2.2.96 operand: An argument to a command that is generally used as an object supplying information to a utility necessary to complete its processing. Operands generally follow the options in a command line. See 2.10.1. 2.2.2.97 option: An argument to a command that is generally used to specify changes in the _u_t_i_l_i_t_y's default behavior; see 2.10.1. 2.2.2.98 option-argument: A parameter that follows certain options. In some cases an option-argument is included within the same argument string as the option; in most cases it is the next argument. See 2.10.1. 2.2.2.99 parent directory: (1) When discussing a given directory, the directory that both contains a directory entry for the given directory and is represented by the pathname dot-dot in the given directory. (2) When discussing other types of files, a directory containing a directory entry for the file under discussion. This concept does not apply to dot and dot-dot. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 43 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.100 parent process: See _p_r_o_c_e_s_s in 2.2.2.114. [POSIX.1 {8}] 2.2.2.101 parent process ID: An attribute of a new process after it is created by a currently active process. The parent process ID of a process is the process ID of its creator, for the lifetime of the creator. After the creator's lifetime has ended, the parent process ID is the process ID of an implementation-defined system process. [POSIX.1 {8}] 2.2.2.102 pathname: A string that is used to identify a file. A pathname consists of, at most, {PATH_MAX} bytes, including the terminating null character. It has an optional beginning slash, followed by zero or more filenames separated by slashes. If the pathname refers to a directory, it may also have one or more trailing slashes. Multiple successive slashes are considered to be the same as one slash. A pathname that begins with two successive slashes may be interpreted in an implementation-defined manner, although more than two leading slashes shall be treated as a single slash. The interpretation of the pathname is described in _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. [POSIX.1 {8}] 2.2.2.103 pathname component: See _f_i_l_e_n_a_m_e in 2.2.2.61. [POSIX.1 {8}] 2.2.2.104 pathname resolution: A concept of the underlying system, as follows. [POSIX.1 {8}] Pathname resolution is performed for a process to resolve a pathname to a particular file in a file hierarchy. There may be multiple pathnames that resolve to the same file. Each filename in the pathname is located in the directory specified by its predecessor (for example, in the pathname fragment ``a/b'', file ``b'' is located in directory ``a''). Pathname resolution fails if this cannot be accomplished. If the pathname begins with a slash, the predecessor of the first filename in the pathname is taken to be the root directory of the process (such pathnames are referred to as absolute pathnames). If the pathname does not begin with a slash, the predecessor of the first filename of the pathname is taken to be the current working directory of the process (such pathnames are referred to as ``relative pathnames''). The interpretation of a pathname component is dependent on the values of {NAME_MAX} and {_POSIX_NO_TRUNC} associated with the path prefix of that component. If any pathname component is longer than {NAME_MAX}, and {_POSIX_NO_TRUNC} is in effect for the path prefix of that component [see _p_a_t_h_c_o_n_f() in POSIX.1 {8} 5.7.1], the implementation shall consider this Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 44 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 an error condition. Otherwise, the implementation shall use the first {NAME_MAX} bytes of the pathname component. The special filename dot refers to the directory specified by its predecessor. The special filename dot-dot refers to the parent directory of its predecessor directory. As a special case, in the root directory, dot-dot may refer to the root directory itself. A pathname consisting of a single slash resolves to the root directory of the process. A null pathname is invalid. 2.2.2.105 path prefix: A pathname, with an optional ending slash, that refers to a directory. [POSIX.1 {8}] 2.2.2.106 pattern: A sequence of characters used either with regular expression notation (see 2.8) or for pathname expansion (see 3.6.6), as a means of selecting various character strings or pathnames, respectively. The syntaxes of the two patterns are similar, but not identical; this standard always indicates the type of pattern being referred to in the immediate context of the use of the term. 2.2.2.107 period: The character ``.''. The term _p_e_r_i_o_d is contrasted against _d_o_t (2.2.2.38), which is used to describe a specific directory entry. 2.2.2.108 permissions: See _f_i_l_e _a_c_c_e_s_s _p_e_r_m_i_s_s_i_o_n_s in 2.2.2.55. 2.2.2.109 pipe: An object accessed by one of the pair of file descriptors created by the POSIX.1 {8} _p_i_p_e() function. Once created, the file descriptors can be used to manipulate it, and it behaves identically to a FIFO special file when accessed in this way. It has no name in the file hierarchy. [POSIX.1 {8}] 2.2.2.110 portable character set: The set of characters described in 2.4 that is supported on all conforming systems. This term is contrasted against the smaller _p_o_r_t_a_b_l_e _f_i_l_e_n_a_m_e _c_h_a_r_a_c_t_e_r _s_e_t; see 2.2.2.111. 2.2.2.111 portable filename character set: The set of characters from which portable filenames are constructed. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 45 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX For a filename to be portable across conforming implementations of this standard, it shall consist only of the following characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9 . _ - The last three characters are the period, underscore, and hyphen characters, respectively. The hyphen shall not be used as the first character of a portable filename. Upper- and lowercase letters shall retain their unique identities between conforming implementations. In the case of a portable pathname, the slash character may also be used. [POSIX.1 {8}] 2.2.2.112 printable character: One of the characters included in the print character classification of the LC_CTYPE category in the current locale; see 2.5.2.1. 2.2.2.113 privilege: See _a_p_p_r_o_p_r_i_a_t_e _p_r_i_v_i_l_e_g_e_s in 2.2.2.6. [POSIX.1 {8}] 2.2.2.114 process: An address space and single thread of control that executes within that address space, and its required system resources. A process is created by another process issuing the POSIX.1 {8} _f_o_r_k() function. The process that issues _f_o_r_k() is known as the parent process, and the new process created by the _f_o_r_k() is known as the child process. [POSIX.1 {8}] The attributes of processes required by POSIX.2 form a subset of those in POSIX.1 {8}; see 2.9.1. 2.2.2.115 process group: A collection of processes that permits the signaling of related processes. Each process in the system is a member of a process group that is identified by a process group ID. A newly created process joins the process group of its creator. [POSIX.1 {8}] 2.2.2.116 process group ID: The unique identifier representing a process group during its lifetime. A process group ID is a positive integer that can be contained in a _p_i_d__t. It shall not be reused by the system until the process group lifetime ends. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 46 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.117 process group leader: A process whose process ID is the same as its process group ID. [POSIX.1 {8}] 2.2.2.118 process ID: The unique identifier representing a process. A process ID is a positive integer that can be contained in a _p_i_d__t. A process ID shall not be reused by the system until the process lifetime ends. In addition, if there exists a process group whose process group ID is equal to that process ID, the process ID shall not be reused by the system until the process group lifetime ends. A process that is not a system process shall not have a process ID of 1. [POSIX.1 {8}] 2.2.2.119 program: A prepared sequence of instructions to the system to accomplish a defined task. The term _p_r_o_g_r_a_m in POSIX.2 encompasses applications written in the Shell Command Language, complex utility input languages (for example, awk, lex, sed, etc.), and high-level languages. 2.2.2.120 read-only file system: A file system that has implementation-defined characteristics restricting modifications. [POSIX.1 {8}] 2.2.2.121 real group ID: The attribute of a process that, at the time of process creation, identifies the group of the user who created the process. See _g_r_o_u_p _I_D in 2.2.2.75. This value is subject to change during the process lifetime, as described in POSIX.1 {8} 4.2.2 [_s_e_t_g_i_d()]. [POSIX.1 {8}] 2.2.2.122 real user ID: The attribute of a process that, at the time of process creation, identifies the user who created the process. See _u_s_e_r _I_D in 2.2.2.154. This value is subject to change during the process lifetime, as described in POSIX.1 {8} 4.2.2 [_s_e_t_u_i_d()]. [POSIX.1 {8}] 2.2.2.123 regular expression: A pattern (sequence of characters or 1 symbols) constructed according to the rules defined in 2.8. 1 2.2.2.124 regular file: A file that is a randomly accessible sequence of bytes, with no further structure imposed by the system. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 47 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.2.2.125 relative pathname: See _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104. [POSIX.1 {8}] 2.2.2.126 root directory: A directory, associated with a process, that is used in pathname resolution for pathnames that begin with a slash. [POSIX.1 {8}] 2.2.2.127 saved set-group-ID: An attribute of a process that allows some flexibility in the assignment of the effective group ID attribute, when the saved set-user-ID option is implemented, as described in POSIX.1 {8} 3.1.2 (_e_x_e_c) and 4.2.2 [_s_e_t_g_i_d()]. [POSIX.1 {8}] 2.2.2.128 saved set-user-ID: An attribute of a process that allows some flexibility in the assignment of the effective user ID attribute, when the saved set-user-ID option is implemented, as described in POSIX.1 {8} 3.1.2 and 4.2.2 [_s_e_t_u_i_d()]. [POSIX.1 {8}] 2.2.2.129 seconds since the Epoch: A value to be interpreted as the number of seconds between a specified time and the Epoch. A Coordinated Universal Time name [specified in terms of seconds (_t_m__s_e_c), minutes (_t_m__m_i_n), hours (_t_m__h_o_u_r), days since January 1 of the year (_t_m__y_d_a_y), and calendar year minus 1900 (_t_m__y_e_a_r)] is related to a time represented as seconds since the Epoch, according to the expression below. If the year < 1970 or the value is negative, the relationship is undefined. If the year _> 1970 and the value is nonnegative, the value is related to a Coordinated Universal Time name according to the expression: _t_m__s_e_c + _t_m__m_i_n*60 + _t_m__h_o_u_r*3600 + _t_m__y_d_a_y*86400 + (_t_m__y_e_a_r-70)*31536000 + ((_t_m__y_e_a_r-69)/4)*86400 [POSIX.1 {8}] 2.2.2.130 session: A collection of process groups established for job control purposes. Each process group is a member of a session. A process is considered to be a member of the session of which its process group is a member. A newly created process joins the session of its creator. A process can alter its session membership (see POSIX.1 {8} 4.3.2 [_s_e_t_s_i_d()]. Implementations that support the POSIX.1 {8} _s_e_t_p_g_i_d() function (see POSIX.1 {8} 4.3.3) can have multiple process groups in the same session. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 48 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.131 session leader: A process that has created a session; see POSIX.1 {8} 4.3.2 [_s_e_t_s_i_d()]. [POSIX.1 {8}] 2.2.2.132 session lifetime: The period between when a session is created and the end of the lifetime of all the process groups that remain as members of the session. [POSIX.1 {8}] 2.2.2.133 shell: A program that interprets sequences of text input as commands. It may operate on an input stream or it may interactively prompt and read commands from a terminal. 2.2.2.134 Shell, The: The Shell Command Language Interpreter (see 4.56), a specific instance of a shell. 2.2.2.135 shell script: A file containing shell commands. If the file is made executable, it can be executed by specifying its name as a simple command (see the description of _s_i_m_p_l_e _c_o_m_m_a_n_d in 3.9.1). Execution of a shell script causes a shell to execute the commands within the script. Alternately, a shell can be requested to execute the commands in a shell script by specifying the name of the shell script as the operand to the sh utility. 2.2.2.136 signal: A mechanism by which a process may be notified of, or affected by, an event occurring in the system. Examples of such events include hardware exceptions and specific actions by processes. The term _s_i_g_n_a_l is also used to refer to the event itself. [POSIX.1 {8}] 2.2.2.137 single-quote: The character ``''', also known as _a_p_o_s_t_r_o_p_h_e. 2.2.2.138 slash: The character ``/'', also known as _s_o_l_i_d_u_s. 2.2.2.139 source code: When dealing with the Shell Command Language, source code is input to the command language interpreter. The term _s_h_e_l_l _s_c_r_i_p_t is synonymous with this meaning. When dealing with the C Language Bindings Option, source code is input to a C compiler conforming to the C Standard {7}. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 49 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX When dealing with another ISO/IEC conforming language, source code is input to a compiler conforming to that ISO/IEC standard. Source code also refers to the input statements prepared for the following standard utilities: awk, bc, ed, lex, localedef, make, sed, and yacc. Source code can also refer to a collection of sources meeting any or all of these meanings. _2._2._2._1_4_0 : The character defined in 2.4 as . The character is a member of the space character class of the current locale, but represents the single character, and not all of the possible members of the class. (See 2.2.2.158.) 2.2.2.141 standard error: An output stream usually intended to be used for diagnostic messages. 2.2.2.142 standard input: An input stream usually intended to be used for primary data input. 2.2.2.143 standard output: An output stream usually intended to be used for primary data output. 2.2.2.144 standard utilities: The utilities defined by this standard, in the Sections 4, 5, and 6, and Annex A, and Annex C, and in similar sections of utility definitions introduced in future revisions of, and supplements to, this standard. 2.2.2.145 stream: An ordered sequence of characters, as described by the C Standard {7}. 2.2.2.146 supplementary group ID: An attribute of a process used in determining file access permissions. A process has up to {NGROUPS_MAX} supplementary group IDs in addition to the effective group ID. The supplementary group IDs of a process are set to the supplementary group IDs of the parent process when the process is created. Whether a process's effective group ID is included in or omitted from its list of supplementary group IDs is unspecified. [POSIX.1 {8}] Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 50 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.2.147 system: An implementation of this standard. 2.2.2.148 : The horizontal tab character. 2.2.2.149 terminal [terminal device]: A character special file that obeys the specifications of the POSIX.1 {8} General Terminal Interface. [POSIX.1 {8}] 2.2.2.150 text column: A roughly rectangular block of characters capable of being laid out side-by-side next to other text columns on an output page or terminal screen. The widths of text columns are measured in column positions. 2.2.2.151 text file: A file that contains characters organized into one or more lines. The lines shall not contain NUL characters and none shall exceed {LINE_MAX} bytes in length, including the . Although POSIX.1 {8} does not distinguish between text files and binary files (see the C Standard {7}), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify _t_e_x_t _f_i_l_e_s in their Standard Input or Input Files subclauses. 2.2.2.152 tilde: The character ``~''. 2.2.2.153 user database: See Section 9 in POSIX.1 {8}. 2.2.2.154 user ID: A nonnegative integer, which can be contained in an object of type _u_i_d__t, that is used to identify a system user. When the identity of a user is associated with a process, a user ID value is referred to as a real user ID, an effective user ID, or an (optional) saved set-user-ID. [POSIX.1 {8}] 2.2.2.155 user name: A string that is used to identify a user, as described in POSIX.1 {8} 9.1. [POSIX.1 {8}] 2.2.2.156 utility: A program that can be called by name from a shell to perform a specific task, or related set of tasks. This program shall either be an executable file, such as might be produced by a compiler/linker system from computer source code, or a file of shell source code, directly interpreted by the shell. The program may Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 51 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX have been produced by the user, provided by the implementor of this standard, or acquired from an independent distributor. The term _u_t_i_l_i_t_y does not apply to the special built-in utilities provided as part of the shell command language; see 3.14. The system may implement certain utilities as shell functions (see 3.9.5) or built-ins (see 2.3), but only an application that is aware of the command search order described in 3.9.1.1 or of performance characteristics can discern differences between the behavior of such a function or built-in and that of a true executable file. _2._2._2._1_5_7 : The vertical tab character. 2.2.2.158 white space: A sequence of one or more characters that belong to the space character class as defined via the LC_CTYPE category in the current locale. In the POSIX Locale, white space consists of one or more s (s and s), s, s, s, and s. 2.2.2.159 working directory [current working directory]: A directory, associated with a process, that is used in pathname resolution for pathnames that do not begin with a slash. 2.2.2.160 write: To output characters to a file, such as standard output or standard error. Unless otherwise stated, standard output is the default output destination for all uses of the term _w_r_i_t_e. BEGIN_RATIONALE 2.2.2.161 General Terms Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Many of the terms originated in POSIX.1 {8} and are duplicated in this standard to meet editorial requirements. In some cases, there is supplementary text that presents additional information concerning POSIX.2 aspects of the concept. This standard uses the term _c_h_a_r_a_c_t_e_r to mean a sequence of one or more bytes representing a single graphic symbol, as defined in POSIX.1 {8}. 1 The deviation in the exact text of the C Standard {7} definition for _b_y_t_e 1 meets the intent of the C Standard {7} Rationale and the developers of 1 POSIX.1 {8}, but clears up the ambiguity raised by the term _b_a_s_i_c 1 _e_x_e_c_u_t_i_o_n _c_h_a_r_a_c_t_e_r _s_e_t, which is not defined in POSIX.1 {8}. It is 1 expected that a future version of POSIX.1 {8} will align with the text 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 52 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 used here. The octet-minimum requirement is merely a reflection of the 1 {CHAR_BIT} value in POSIX.1 {8} and the C Standard {7}. 1 The POSIX.1 {8} term _f_i_l_e _m_o_d_e is a superset of the POSIX.2 _f_i_l_e _m_o_d_e _b_i_t_s. POSIX.1 {8} defines the file mode as the entire _m_o_d_e__t object (which includes the file type in historically the upper four bits, the sticky bit on most implementations, and potentially other nonstandardized attributes), while POSIX.2 file mode bits include only the eleven defined bits. The terms _c_o_m_m_a_n_d and _u_t_i_l_i_t_y are related but have distinct meanings. Command is defined as ``a directive to a shell to perform a specific task.'' The directive can be in the form of a single utility name (for example, ls), or the directive can take the form of a compound command (for example, ls | grep name | pr). A utility is a program that is callable by name from a shell. Issuing only the utility's name to a shell is the equivalent of a one-word command. A utility may be invoked as a separate program that executes in a different process than the command language interpreter, or may be implemented as a part of the command language interpreter. For example, the echo command (the directive to perform a specific task) may be implemented such that the echo utility (the logic that performs the task of echoing) is in a separate program; and therefore, is executed in a process that is different than the command language interpreter. Conversely, the logic that performs the echo utility could be built into the command language interpreter; and therefore, execute in the same process as the command language interpreter. The terms _t_o_o_l and _a_p_p_l_i_c_a_t_i_o_n can be thought of as being synonymous with _u_t_i_l_i_t_y from the perspective of the operating system kernel. Tools, applications, and utilities have historically run, typically, in processes above the kernel level. Tools and utilities have been historically a part of the operating system nonkernel code, and performed system related functions such as listing directory contents, checking file systems, repairing file systems, or extracting system status information. Applications have not generally been a part of the operating system, and perform nonsystem related functions such as word processing, architectural design, mechanical design, workstation publishing, or financial analysis. Utilities have most frequently been provided by the operating system vendor, applications by third party software vendors or by the users themselves. Nevertheless, the standard does not differentiate between tools, utilities, and applications when it comes to receiving services from the system, a shell, or the standard utilities. (For example, the xargs utility invokes another utility; it would be of fairly limited usefulness if the users couldn't run their own applications in place of the standard utilities.) Utilities are not applications in the sense that they are not themselves subjects to the restrictions of this standard or any other standard--there is no Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 53 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX requirement for grep, stty, or any of the utilities defined here to be any of the classes of Conforming POSIX.2 Applications. The term _t_e_x_t _f_i_l_e does not prevent the inclusion of control or other nonprintable characters (other than NUL). Therefore, standard utilities that list text files as inputs or outputs are either able to process the special characters gracefully or they explicitly describe their limitations within their individual subclauses. The definition of _t_e_x_t _f_i_l_e has caused a good deal of controversy. The only difference between text and binary here is that text files have lines of (less than {LINE_MAX}) bytes, with no NUL characters, each terminated by a character. The definition allows a file with a single , but not a totally empty file, to be called a text file. If a file ends with an incomplete line it is not strictly a text file by this definition. A related point is that the character referred to in this standard is not some generic line separator, but a single character; files created on systems where they use multiple characters for ends of lines are not portable to all POSIX systems without some translation process unspecified by this standard. The term _h_a_r_d _l_i_n_k is historically-derived. In systems without extensions to ln, it is a synonym for _l_i_n_k. The concept of a _s_y_m_b_o_l_i_c _l_i_n_k originated with BSD systems and the term _h_a_r_d is used to differentiate between the two types of links. There are some terms used that are undefined in POSIX.2, POSIX.1 {8}, or the C Standard {7}. The working group believes that these terms have a ``common usage,'' and that a definition in POSIX.2 would not be appropriate. Terms in this category include, but are not limited to, the following: _a_p_p_l_i_c_a_t_i_o_n, _c_h_a_r_a_c_t_e_r _s_e_t, _l_o_g_i_n _s_e_s_s_i_o_n, _u_s_e_r. Good sources for general terms of this type are the _I_S_O/_A_F_N_O_R _D_i_c_t_i_o_n_a_r_y _o_f _C_o_m_p_u_t_e_r _S_c_i_e_n_c_e {B12} and _I_E_E_E _D_i_c_t_i_o_n_a_r_y {B18}. The term _f_i_l_e _n_a_m_e was defined in previous drafts to be a synonym for _p_a_t_h_n_a_m_e. It was removed in the face of objections that it was too close to _f_i_l_e_n_a_m_e, which means something different (a pathname component). The general solution to this has been to use the term _f_i_l_e in parameter names, rather than _f_i_l_e__n_a_m_e, and to make more liberal use of the correct term, _p_a_t_h_n_a_m_e; an alternate solution has been to replace _f_i_l_e _n_a_m_e with _t_h_e _n_a_m_e _o_f _t_h_e _f_i_l_e. Many character names are included in this subclause. Because of historical usage, some of these names are a bit different than the ones used in international standards for character sets, such as ISO/IEC 646 {1}. It was felt that many more UNIX system people than character set lawyers would be reading and reviewing the standard, so the former group was the one accommodated. On the other hand, the precise definitions of , , and _w_h_i_t_e _s_p_a_c_e have replaced common usage (where they have been used virtually interchangeably), as the standard attempts to Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 54 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 balance readability against precision. In earlier drafts, the names for the character pairs ( ), [ ], and { } were referred to as ``opening'' and ``closing'' parentheses, brackets, and braces. These were changed to the current ``left'' and right.'' When the characters are used to express natural language, the terms ``open'' and ``close'' imply text direction more strongly than ``left'' and ``right.'' By POSIX.2 definition, the character will always be mapped to the glyph '(' regardless of the locale. But when reading right-to-left, the opening punctuation of a parenthesized text segment would be ')'. The and forms are the correct ones because the punctuation appears on the left and right, respectively, of the parenthesized text regardless of the direction one might be reading the text. The character and the ERASE special character defined in POSIX.1 {8} should not be confused. The use of the character and the ERASE special character defined in the POSIX.1 {8} _t_e_r_m_i_o_s clause on special characters (7.1.1.9) are distinct even though the ERASE special character may be set to . In most one-byte character sets, such as ASCII, the concepts of column positions is identical to character positions and to bytes. Therefore, it has been historically acceptable for some implementations to describe line folding or tab stops or table column alignment in terms of bytes or character positions. Other character sets pose complications, as they can have internal representations longer than one octet and they can have displayable characters that have different widths on the terminal screen or printer. In this standard the term _c_o_l_u_m_n _p_o_s_i_t_i_o_n_s has been defined to mean character--not byte--positions in input files (such as ``column position 7 of the FORTRAN input''). Output files describe the column position in terms of the display width of the narrowest printable character in the character set, adjusted to fit the characteristics of the output device. It is very possible that _n column positions will not be able to hold _n characters in some character sets, unless all of those characters are of the narrowest width. It is assumed that the implementation is aware of the width of the various characters, deriving this information from the value of LC_CTYPE, and thus can determine how many column positions to allot for each character in those utilities where it is important. This information is not available to the portable application writer because POSIX.2 provides no interface specification to retrieve such information. The term _c_o_l_u_m_n _p_o_s_i_t_i_o_n was used instead of the more natural _c_o_l_u_m_n as the latter is frequently used in the standard in the different contexts of columns of figures, columns of table values, etc. Wherever confusion might result, these latter types of columns are referred to as _t_e_x_t _c_o_l_u_m_n_s. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 55 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX The definition of _b_i_n_a_r_y _f_i_l_e was removed, as the term is not used in the standard. The ISO/IEC 646 {1} character set standard permits substitution of national currency symbols for the character $ in the ``reference character set'' (which is the same as ASCII). This standard permits the substitution only of the actual characters shown in ISO/IEC 646 {1}: currency sign for the dollar sign and pound sign for the number sign. This document uses the latter names and their symbols, but it is valid for an implementation to accept, for instance, the pound sign () as a comment character in the shell, if that is what the locale's character set uses instead of the number sign (#). Other variation of national currency symbols are not allowed, per the request of the WG15 POSIX working group. The term _s_t_r_e_a_m is not related to System V's STREAMS communications facility; it is derived from historical UNIX system usage and has been made official by the C Standard {7}. The POSIX.2 standard makes no differentiation between C's _t_e_x_t _s_t_r_e_a_m and _b_i_n_a_r_y _s_t_r_e_a_m. The formula used in the POSIX.1 {8} definition of _s_e_c_o_n_d_s _s_i_n_c_e _t_h_e _E_p_o_c_h 1 is not perfect in all cases. See the related rationale in POSIX.1 {8}. 1 END_RATIONALE 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 56 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.2.3 Abbreviations For the purposes of this standard, the following abbreviations apply: 2.2.3.1 C Standard: ISO/IEC 9899: ..., _I_n_f_o_r_m_a_t_i_o_n _p_r_o_c_e_s_s_i_n_g _s_y_s_t_e_m_s- -_P_r_o_g_r_a_m_m_i_n_g _l_a_n_g_u_a_g_e_s--_C {7}. 2.2.3.2 ERE: An Extended Regular Expression, as defined in 2.8.4. 2.2.3.3 LC_*: An abbreviation used to represent all of the environment variables named in 2.6 whose names begin with the characters ``LC_''. 2.2.3.4 POSIX.1: ISO/IEC 9945-1: 1990: _I_n_f_o_r_m_a_t_i_o_n _t_e_c_h_n_o_l_o_g_y-- _P_o_r_t_a_b_l_e _O_p_e_r_a_t_i_n_g _S_y_s_t_e_m _I_n_t_e_r_f_a_c_e (_P_O_S_I_X)--_P_a_r_t _1: _S_y_s_t_e_m _A_p_p_l_i_c_a_t_i_o_n _P_r_o_g_r_a_m _I_n_t_e_r_f_a_c_e (_A_P_I) [_C _L_a_n_g_u_a_g_e] {8}. 2.2.3.5 POSIX.2: This standard. 2.2.3.6 RE [BRE]: A Basic Regular Expression, as defined in 2.8.3. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.2 Definitions 57 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.3 Built-in Utilities Any of the standard utilities may be implemented as _r_e_g_u_l_a_r _b_u_i_l_t-_i_n utilities within the command language interpreter. This is usually done to increase the performance of frequently-used utilities or to achieve functionality that would be more difficult in a separate environment. The utilities named in Table 2-2 are frequently provided in built-in form. All of the utilities named in the table have special properties in terms of command search order within the shell, as described in 3.9.1.1. Table 2-2 - Regular Built-in Utilities __________________________________________________________________________________________________________________________________________________ cd false kill true wait command getopts read umask __________________________________________________________________________________________________________________________________________________ However, all of the standard utilities, including the regular built-ins in the table, but not the special built-ins described in 3.14, shall be implemented in a manner so that they can be accessed via the POSIX.1 {8} _e_x_e_c family of functions (if the underlying operating system provides the services of such a family to application programs) and can be invoked directly by those standard utilities that require it (env, find, nohup, xargs). Since versions shall be provided for all utilities except for those listed previously, an application running on a system that conforms to both POSIX.1 {8} and Section 7 of this standard can use the _e_x_e_c family of functions, in addition to the shell command interface in 7.1 [such as the _s_y_s_t_e_m() and _p_o_p_e_n() functions in the C binding] defined by this standard, to execute any of these utilities. BEGIN_RATIONALE 2.3.1 Built-in Utilities Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) In earlier drafts, the table of built-ins implied two things to a conforming application: these may be built-ins and these need not be executable. The second implication has now been removed and all utilities can be _e_x_e_c-ed. There is no requirement that these be actually built into the shell itself, but many shells will want to do so because 3.9.1.1 requires that they be found prior to the PATH search. The shell could satisfy its requirements by keeping a list of the names and directly accessing the file-system versions regardless of PATH. Providing all of the required functionality for those such as cd or read Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 58 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 would be more difficult. There were originally three justifications for allowing the omission of _e_x_e_c-able versions: (1) This would require wasting space in the file system, at the expense of very small systems. However, it has been pointed out that all nine in the table can be provided with nine links to a single-line shell script: $0 "$@" (2) There is no sense in requiring invocation of utilities like cd because they have no value outside the shell environment or cannot be useful in a child process. However, counter-examples always seemed to be available for even the strangest cases: find . -type d -exec cd {} ; -exec foo {} ; (which invokes foo on accessible directories) ps ... | sed ... | xargs kill find . -exec true ; -a ... (where true is used for temporary debugging) (3) It is confusing to have something such as kill that can easily be in the file system in the base standard, but requires built- in status for the UPE (for the % job control job ID notation). It was decided that it was more appropriate to describe the required functionality (rather than the implementation) to the system implementors and let them decide how to satisfy it. On the other hand, there were objections raised during balloting that any distinction like this between utilities was not useful to applications and that the cost to correct it was small. These arguments were ultimately the most effective. There were varying reasons for including utilities in the table of built-ins: cd, getopts, read, umask, wait The functionality of these utilities is performed more simply within the context of the current process. An example can be taken from the usage of the cd utility. The purpose of the utility is to change the working directory for subsequent operations. The actions of cd affect the process in which cd is executed and all subsequent child processes of that process. Based on the POSIX.1 {8} process model, changes in the process Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.3 Built-in Utilities 59 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX environment of a child process have no effect on the parent process. If the cd utility were executed from a child process, the working directory change would be effective only in the child process. Child processes initiated subsequent to the child process that executed the cd utility would not have a changed working directory relative to the parent process. command This utility was placed in the table primarily to protect scripts that are concerned about their PATH being manipulated. The ``secure'' shell script example in 4.12.10 would not be possible if a PATH change retrieved an alien version of command. (An alternative would have been to implement getconf as a built-in, but it was felt that it carried too many changing configuration strings to require in the shell.) kill Since common extensions to kill (including the planned User Portability Extension) provide optional job control functionality using shell notation (%1, %2, etc.), some implementations would find it extremely difficult to provide this outside the shell. true, false These are in the table as a courtesy to programmers who wish to use the ``while true'' shell construct without protecting true from PATH searches. (It is acknowledged that ``while :'' also works, but the idiom with true is historically pervasive.) All utilities, including those in the table, are accessible via the functions in 7.1.1 or 7.1.2 [such as _s_y_s_t_e_m() or _p_o_p_e_n()]. There are situations where the return functionality of _s_y_s_t_e_m() and _p_o_p_e_n() is not desirable. Applications that require the exit status of the invoked utility will not be able to use _s_y_s_t_e_m() or _p_o_p_e_n(), since the exit status returned is that of the command language interpreter rather than that of the invoked utility. The alternative for such applications is the use of the _e_x_e_c family. (The text concerning conformance to POSIX.1 {8} was included because where _e_x_e_c is not provided in the underlying system, there is no way to require that utilities be _e_x_e_c- able). END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 60 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.4 Character Set Conforming implementations shall support one or more coded character sets. Each supported coded character set shall include the _p_o_r_t_a_b_l_e _c_h_a_r_a_c_t_e_r _s_e_t specified in Table 2-3. The table defines the characters in the portable character set and the corresponding symbolic character names used to identify each character in a character set description file. The names are chosen to correspond closely with character names defined in other international standards. The table contains more than one symbolic character name for characters whose traditional name differs from the chosen name. This standard places only the following requirements on the encoded values of the characters in the portable character set: (1) If the encoded values associated with each member of the portable character set are not invariant across all locales supported by the implementation, the results achieved by an application accessing those locales are unspecified. (2) The encoded values associated with the digits '0' to '9' shall be such that the value of each character after '0' shall be one greater than the value of the previous character. (3) A null character, NUL, which has all bits set to zero, shall be in the set of characters. Conforming implementations shall support certain character and character set attributes, as defined in 2.5.1. 2.4.1 Character Set Description File Implementations shall provide a character set description file for at least one coded character set supported by the implementation. These files are referred to elsewhere in this standard as _c_h_a_r_m_a_p files. It is implementation defined whether or not users or applications can provide additional character set description files. If such a capability is supported, the system documentation shall describe the rules for the creation of such files. Each character set description file shall define characteristics for the coded character set and the encoding for the characters specified in Table 2-3, and may define encoding for additional characters supported by the implementation. Other information about the coded character set may also be in the file. Coded character set character values shall be defined using symbolic character names followed by character encoding values. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.4 Character Set 61 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Table 2-3 - Character Set and Symbolic Names __________________________________________________________________________________________________________________________________________________ Symbolic Symbolic Symbolic Name Glyph Name Glyph Name Glyph _____________________________________________________________________________ : ^ ; ^ < _ = _ > ` ? a @ b A c B d ! C e " D f # E g $ F h % G i & H j ' I k ( J l ) K m * L n + M o , N

p - O q -

;;;;;;;;;; # lower ;;;;;;;;;;;;;\ ;;

;;;;;;;;;; # digit ;;;;;;;;; # space ;;;;; # cntrl ;;;;;\ ;;\ ;;;;;;;;\ ;;;;;;;;\ ;;;;;;;;\ ; # punct ;;;\ ;;;;\ Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 76 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 ;;;\ ;;;;;\ ;;;;\ ;; ;;;\ ;;;\ ;;; # xdigit ;;;;;;;;;\ ;;;;;;;;;;;; # blank ; # toupper (,);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);\ (

,

);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);(,) # tolower (,);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);\ (

,

);(,);(,);(,);(,);\ (,);(,);(,);(,);(,);(,) END LC_CTYPE __________________________________________________________________________________________________________________________________________________ The LC_CTYPE category shall define character classification, case conversion, and other character attributes. In addition, a series of characters can be represented by three adjacent periods representing an 1 ellipsis symbol (``...''). The ellipsis specification shall be 1 interpreted as meaning that all values between the values preceding and 1 following it represent valid characters. The ellipsis specification only 1 shall be valid within a single encoded character set. An ellipsis shall be interpreted as including in the list all characters with an encoded value higher than the encoded value of the character preceding the ellipsis and lower than the encoded value of the character following the ellipsis. _E_x_a_m_p_l_e: \x30;...;\x39; includes in the character class all characters with encoded values between the endpoints. The following keywords shall be recognized. In the descriptions, the term ``automatically included'' means that it shall not be an error to either include the referenced characters or to omit them; the implementation shall provide them if missing and accept them silently if present. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 77 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. upper Define characters to be classified as uppercase letters. No character specified for the keywords cntrl, digit, punct, or space shall be specified. If this keyword is 2 not specified, the uppercase letters A through Z, as 2 defined in Table 2-3 (see 2.4.1), shall automatically 2 belong to this class, with implementation-defined 2 character values. 2 lower Define characters to be classified as lowercase letters. No character specified for the keywords cntrl, digit, punct, or space shall be specified. If this keyword is 2 not specified, the lowercase letters a through z, as 2 defined in Table 2-3 (see 2.4.1), shall automatically 2 belong to this class, with implementation-defined 2 character values. 2 alpha Define characters to be classified as letters. No character specified for the keywords cntrl, digit, punct, or space shall be specified. In addition, characters classified as either upper or lower shall automatically belong to this class. digit Define the characters to be classified as numeric digits. 2 Only the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 shall be 2 specified, and in ascending sequence by numerical value. 2 If this keyword is not specified, the digits 0 through 9, 2 as defined in Table 2-3 (see 2.4.1), shall automatically 2 belong to this class, with implementation-defined 2 character values. 2 space Define characters to be classified as white-space characters. No character specified for the keywords upper, lower, alpha, digit, graph, or xdigit shall be 1 specified. If this keyword is not specified, the 2 characters , , , , , and , as defined in 2 Table 2-3 (see 2.4.1), shall automatically belong to this 2 class, with implementation-defined character values. Any 2 characters included in the class blank shall be 1 automatically included. 1 cntrl Define characters to be classified as control characters. No character specified for the keywords upper, lower, alpha, digit, punct, graph, print, or xdigit shall be 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 78 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 specified. 1 punct Define characters to be classified as punctuation characters. No character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the character shall be specified. graph Define characters to be classified as printable characters, not including the character. If this keyword is not specified, characters specified for the keywords upper, lower, alpha, digit, xdigit, and punct shall belong to this character class. No character specified for the keyword cntrl shall be specified. print Define characters to be classified as printable characters, including the character. If this keyword is not provided, characters specified for the keywords upper, lower, alpha, digit, xdigit, punct, and the character shall belong to this character class. No character specified for the keyword cntrl shall be specified. xdigit Define the characters to be classified as hexadecimal digits. Only the characters defined for the class digit 2 shall be specified, in ascending sequence by numerical 2 value, followed by one or more sets of six characters 2 representing the hexadecimal digits 10 through 15, with 2 each set in ascending order (for example A, B, C, D, E, 2 F, a, b, c, d, e, f). If this keyword is not specified, 2 the digits 0 through 9, the uppercase letters A through 2 F, and the lowercase letters a through f, as defined in 2 Table 2-3 (see 2.4.1), shall automatically belong to this 2 class, with implementation-defined character values. 2 blank Define characters to be classified as characters. If this keyword is unspecified, the characters and shall belong to this character class. toupper Define the mapping of lowercase letters to uppercase letters. The operand shall consist of character pairs, separated by semicolons. The characters in each character pair shall be separated by a comma and the pair enclosed by parentheses. The first character in each pair shall be the lowercase letter, the second the corresponding uppercase letter. Only characters specified for the keywords lower and upper shall be specified. If this keyword is not specified, the 2 lowercase letters a through z, and their corresponding 2 uppercase letters A through Z, as defined in Table 2-3 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 79 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX (see 2.4.1), shall automatically be included, with 2 implementation-defined character values. 2 tolower Define the mapping of uppercase letters to lowercase letters. The operand shall consist of character pairs, separated by semicolons. The characters in each character pair are separated by a comma and the pair enclosed by parentheses. The first character in each pair shall be the uppercase letter, the second the corresponding lowercase letter. Only characters specified for the keywords lower and upper shall be specified. The tolower keyword is optional. If specified, the uppercase letters A through Z, as defined in Table 2-3, and their corresponding lowercase letter, shall be specified. If this keyword is not specified, the mapping shall be the reverse mapping of the one specified for toupper. Table 2-6 shows the allowed character class combinations. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 80 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 Table 2-6 - Valid Character Class Combinations __________________________________________________________________________________________________________________________________________________ _____________________________________________________________________________ | In |_________________________C_a_n__A_l_s_o__B_e_l_o_n_g__T_o__________________________| |Class | upper lower alpha digit space cntrl punct graph print xdigit blank | _|________|____________________________________________________________________| |upper | - - M X X X X D D - X | |lower | - - M X X X X D D - X | |alpha | - - - X X X X D D - X | |digit | X X X - X X X D D - X | |space | X X X X - - * * * X - 2| |cntrl | X X X X - - X X X X - 2| |punct | X X X X - X - D D X - | |graph | - - - - - X - - - - - | |print | - - - - - X - - - - - | |xdigit | - - - - X X X D D - X | _||b_l_a_n_k____||___X______X______X______X______M______-______*______*______*______X_______-___2_|| NOTES: (1) Explanation of codes: M Always D Default; belongs to class if not specified - Permitted X Mutually exclusive * See note (2) (2) The character, which is part of the space and blank classes, cannot belong to punct or graph, but automatically shall belong to the print class. Other space or blank characters can be classified as punct, graph, and/or print. __________________________________________________________________________________________________________________________________________________ BEGIN_RATIONALE 2.5.2.1.1 LC_CTYPE Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The LC_CTYPE category primarily is used to define the encoding- independent aspects of a character set, such as character classification. In addition, certain encoding-dependent characteristics are also defined for an application via the LC_CTYPE category. POSIX.2 does not mandate Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 81 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX that the encoding used in the locale is the same as the one used by the application, because an implementation may decide that it is advantageous to define locales in a system-wide encoding rather than having multiple, logically identical locales in different encodings, and to convert from the application encoding to the system-wide encoding on usage. Other implementations could require encoding-dependent locales. In either case, the LC_CTYPE attributes that are directly dependent on the encoding, such as mb_cur_max and the display width of characters, are not user-specifiable in a locale source, and are consequently not defined as keywords. As the LC_CTYPE character classes are based on the C Standard {7} character-class definition, the category does not support multicharacter elements. For instance, the German character is traditionally classified as a lowercase letter. There is no corresponding uppercase letter; in proper capitalization of German text the will be replaced by SS; i.e., by two characters. This kind of conversion is outside the scope of the toupper and tolower keywords. Where POSIX.2 specifies that only certain characters can be specified, as 1 for the keywords digit and xdigit, the specified characters must be from 1 the portable character set, as shown. As an example, only the Arabic 1 digits 0 through 9 are acceptable as digits. 1 The character classes digit, xdigit, lower, upper, and space have a set 2 of automatically included characters. These only need to be specified if 2 the character values (i.e., encoding) differs from the implementation 2 default values. 2 The definition of character class digit requires that only ten 2 characters--the ones defining digits--can be specified; alternate digits 2 (e.g., Hindi or Kanji) cannot be specified here. However, the encoding 2 may vary if an implementation supports more than one encoding. 2 The definition of character class xdigit requires that the characters 2 included in character class digit are included here also, and allows for 2 different symbols for the hexadecimal digits 10 through 15. 2 END_RATIONALE 2 2.5.2.2 LC_COLLATE A collation sequence definition shall define the relative order between collating elements (characters and multicharacter collating elements) in the locale. This order is expressed in terms of collation values; i.e., by assigning each element one or more collation values (also known as collation weights). This does not imply that implementations shall assign such values, but that ordering of strings using the resultant Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 82 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 collation definition in the locale shall behave as if such assignment is done and used in the collation process. The collation sequence definition shall be used by regular expressions, pattern matching, and sorting. The following capabilities are provided: (1) Multicharacter collating elements. Specification of multicharacter collating elements (i.e., sequences of two or more characters to be collated as an entity). (2) User-defined ordering of collating elements. Each collating element shall be assigned a collation value defining its order in the character (or basic) collation sequence. This ordering is used by regular expressions and pattern matching and, unless collation weights are explicitly specified, also as the collation weight to be used in sorting. (3) Multiple weights and equivalence classes. Collating elements can be assigned one or more (up to the limit {COLL_WEIGHTS_MAX}) collating weights for use in sorting. The first weight is hereafter referred to as the primary weight. (4) One-to-Many mapping. A single character is mapped into a string of collating elements. (5) Many-to-Many substitution. A string of one or more characters is substituted by another string (or an empty string, i.e., the character or characters shall be ignored for collation purposes). (6) Equivalence class definition. Two or more collating elements have the same collation value (primary weight). (7) Ordering by weights. When two strings are compared to determine 2 their relative order, the two strings are first broken up into a 2 series of collating elements, and each successive pair of 2 elements are compared according to the relative primary weights 2 for the elements. If equal, and more than one weight has been 2 assigned, then the pairs of collating elements are recompared 2 according to the relative subsequent weights, until either a 2 pair of collating elements compare unequal or the weights are 2 exhausted. 2 The following keywords shall be recognized in a collation sequence definition. They are described in detail in the following subclauses. copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 83 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX collating-element Define a collating-element symbol representing a 1 multicharacter collating element. This keyword 1 is optional. collating-symbol Define a collating symbol for use in collation 1 order statements. This keyword is optional. 1 2 order_start Define collation rules. This statement is followed by one or more collation order statements, assigning character collation values and collation weights to collating elements. order_end Specify the end of the collation-order 1 statements. 1 Table 2-7 - LC_COLLATE Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_COLLATE # This is the POSIX Locale definition for the LC_COLLATE category. # The order is the same as in the ASCII code set. order_start forward Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 84 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 _________________________________________________________________________ Table 2-7 - LC_COLLATE Category Definition in the POSIX Locale (_c_o_n_t_i_n_u_e_d) _________________________________________________________________________ Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 85 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX

_________________________________________________________________________ 2.5.2.2.1 collating-element Keyword In addition to the collating elements in the character set, the collating-element keyword shall be used to define multicharacter collating elements. The syntax is "collating-element %s from %s\n", <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l>, <_s_t_r_i_n_g> The <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l> operand shall be a symbolic name, enclosed between 1 angle brackets (< and >), and shall not duplicate any symbolic name in the current charmap file (if any), or any other symbolic name defined in this collation definition. The string operand shall be a string of two or more characters that shall collate as an entity. A <_c_o_l_l_a_t_i_n_g- 1 _e_l_e_m_e_n_t> defined via this keyword is only recognized with the LC_COLLATE 1 category. _E_x_a_m_p_l_e: collating-element from collating-element from collating-element from ll Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 86 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 Table 2-7 - LC_COLLATE Category Definition in the POSIX Locale (_c_o_n_c_l_u_d_e_d) _________________________________________________________________________

order_end # END LC_COLLATE __________________________________________________________________________________________________________________________________________________ _2._5._2._2._2 collating-symbol _K_e_y_w_o_r_d This keyword shall be used to define symbols for use in collation sequence statements; i.e., between the order_start and the order_end keywords. The syntax is Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 87 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX "collating-symbol %s\n", <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l> The <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l> shall be a symbolic name, enclosed between angle 1 brackets (< and >), and shall not duplicate any symbolic name in the current charmap file (if any), or any other symbolic name defined in this collation definition. A <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l> defined via this keyword is only recognized with the LC_COLLATE category. _E_x_a_m_p_l_e: collating-symbol collating-symbol 2 _2._5._2._2._3 order_start _K_e_y_w_o_r_d The order_start keyword shall precede collation order entries and also defines the number of weights for this collation sequence definition and other collation rules. The syntax of the order_start keyword is: "order_start %s;%s;...;%s\n", <_s_o_r_t-_r_u_l_e_s>, <_s_o_r_t-_r_u_l_e_s> ... The operands to the order_start keyword are optional. If present, the operands define rules to be applied when strings are compared. The number of operands define how many weights each element is assigned; if no operands are present, one forward operand is assumed. If present, the first operand defines rules to be applied when comparing strings using the first (primary) weight; the second when comparing strings using the second weight, and so on. Operands shall be separated by semicolons (;). Each operand shall consist of one or more collation directives, separated by commas (,). If the number or operands exceeds the {COLL_WEIGHTS_MAX} limit, the utility shall issue a warning message. The following directives shall be supported: forward Specifies that comparison operations for the weight level shall proceed from start of string towards the end of string. backward Specifies that comparison operations for the weight level shall proceed from end of string towards the beginning of string. 2 position Specifies that comparison operations for the weight level will consider the relative position of non- 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 88 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 IGNOREd elements in the strings. The string 2 containing a non-IGNOREd element after the fewest 2 IGNOREd collating elements from the start of the 2 compare shall collate first. If both strings 2 contain a non-IGNOREd character in the same 2 relative position, the collating values assigned to 2 the elements shall determine the ordering. In case 2 of equality, subsequent non-IGNOREd characters 2 shall be considered in the same manner. 2 The directives forward and backward are mutually exclusive. _E_x_a_m_p_l_e: order_start forward;backward 2 If no operands are specified, a single forward operand shall be assumed. 1 2.5.2.2.4 Collation Order The order_start keyword shall be followed by collating element entries. The syntax for the collating element entries is "%s %s;%s;...;%s\n", <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t>, <_w_e_i_g_h_t>, <_w_e_i_g_h_t>, ... Each _c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t shall consist of either a character (in any of the 1 forms defined in 2.5.2), a <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t>, a <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l>, an 1 ellipsis, or the special symbol UNDEFINED. The order in which collating 1 elements are specified determines the character collation sequence, such 1 that each collating element shall compare less than the elements 1 following it. The NUL character shall compare lower than any other 1 character. 1 A <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t> shall be used to specify multicharacter collating 1 elements, and indicates that the character sequence specified via the 1 <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t> is to be collated as a unit and in the relative order 1 specified by its place. 1 A <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l> shall be used to define a position in the relative 1 order for use in weights. 1 The ellipsis symbol (``...'') specifies that a sequence of characters 1 shall collate according to their encoded character values. It shall be 1 interpreted as indicating that all characters with a coded character set value higher than the value of the character in the preceding line, and lower than the coded character set value for the character in the following line, in the current coded character set, shall be placed in the character collation order between the previous and the following character in ascending order according to their coded character set Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 89 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX values. An initial ellipsis shall be interpreted as if the preceding line specified the NUL character, and a trailing ellipsis as if the following line specified the highest coded character set value in the current coded character set. An ellipsis shall be treated as invalid if the preceding or following lines do not specify characters in the current coded character set. The use of the ellipsis symbol ties the definition 1 to a specific coded character set and may preclude the definition from 1 being portable between implementations. 1 The symbol UNDEFINED shall be interpreted as including all coded character set values not specified explicitly or via the ellipsis symbol. Such characters shall be inserted in the character collation order at the point indicated by the symbol, and in ascending order according to their 1 coded character set values. If no UNDEFINED symbol is specified, and the 1 current coded character set contains characters not specified in this clause, the utility shall issue a warning message and place such characters at the end of the character collation order. The optional operands for each collation-element shall be used to define the primary, secondary, or subsequent weights for the collating element. The first operand specifies the relative primary weight, the second the relative secondary weight, and so on. Two or more collation-elements can be assigned the same weight; they belong to the same _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s if 1 they have the same primary weight. Collation shall behave as if, for 1 each weight level, IGNOREd elements are removed. Then each successive 2 pair of elements shall be compared according to the relative weights for 1 the elements. If the two strings compare equal, the process shall be 1 repeated for the next weight level, up to the limit {COLL_WEIGHTS_MAX}. 1 Weights shall be expressed as characters (in any of the forms specified 1 in 2.5.2), <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l>s, <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t>s, an ellipsis, or the 1 special symbol IGNORE. A single character, a <_c_o_l_l_a_t_i_n_g-_s_y_m_b_o_l>, or a 1 <_c_o_l_l_a_t_i_n_g-_e_l_e_m_e_n_t> shall represent the relative order in the character 1 collating sequence of the character or symbol, rather than the character 1 or characters themselves. 1 One-to-many mapping is indicated by specifying two or more concatenated 1 characters or symbolic names. Thus, if the character ``'' is 1 given the string as a weight, comparisons shall be performed as if 1 all occurrences of the character are replaced by . If it 1 is desirable to define and as an equivalence class, then a 1 collating-element must be defined for the string ``ss'', as in the 1 example below. 1 All characters specified via an ellipsis shall by default be assigned 1 unique weights, equal to the relative order of characters. Characters 1 specified via an explicit or implicit UNDEFINED special symbol shall by 1 default be assigned the same primary weight (i.e., belong to the same 1 equivalence class). An ellipsis symbol as a weight shall be interpreted 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 90 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 to mean that each character in the sequence shall have unique weights, 1 equal to the relative order of their character in the character collation 1 sequence. Secondary and subsequent weights have unique values. The use 1 of the ellipsis as a weight shall be treated as an error if the collating 1 element is neither an ellipsis nor the special symbol UNDEFINED. 1 The special keyword IGNORE as a weight shall indicate that when strings are compared using the weights at the level where IGNORE is specified, the collating element shall be ignored; i.e., as if the string did not contain the collating element. In regular expressions and pattern matching, all characters that are IGNOREd in their primary weight form an equivalence class. An empty operand shall be interpreted as the collating-element itself. For example, the order statement ; is equal to An ellipsis can be used as an operand if the collating-element was an ellipsis, and shall be interpreted as the value of each character defined by the ellipsis. The collation order as defined in this clause defines the interpretation 1 of bracket expressions in regular expressions (see 2.8.3.2). 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 91 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX _E_x_a_m_p_l_e: order_start forward;backward UNDEFINED IGNORE;IGNORE ; ... ;... ; ; ; ; ; ; ; ; ; 2 ; ... ;... order_end This example is interpreted as follows: (1) The UNDEFINED means that all characters not specified in this definition (explicitly or via the ellipsis) shall be ignored for collation purposes; for regular expression purposes they are ordered first. (2) All characters between and shall have the same primary equivalence class and individual secondary weights based on their ordinal encoded values. (3) All characters based on the upper- or lowercase character a belong to the same primary equivalence class. (4) The multicharacter collating element is represented by the collating symbol and belongs to the same primary equivalence class as the multicharacter collating element . (5) Note that it is not possible to use the collating element 1 as a weight and expect it to be expanded to the string ``ss''. 1 When used as a weight, any collating-element represents the 1 relative order assigned to it in the character collation 1 sequence, not the string from which it was derived (compare with 1 ). 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 92 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.5.2.2.5 order_end Keyword The collating order entries shall be terminated with an order_end keyword. BEGIN_RATIONALE 2.5.2.2.6 LC_COLLATE Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The LC_COLLATE category governs the collation order in the locale, and thus the processing of the C Standard {7} _s_t_r_x_f_r_m() and _s_t_r_c_o_l_l() functions, as well as a number of POSIX.2 utilities. The rules governing collation depends to some extent on the use. At least five different levels of increasingly complex collation rules can be distinguished: (1) Byte/machine code order. This is the historical collation order in the UNIX system and many proprietary operating systems. Collation is here done character by character, without any regard to context. The primary virtue is that it usually is quite fast, and also completely deterministic; it works well when the native machine collation sequence matches the user expectations. (2) Character order. On this level, collation is also done character by character, without regard to context. The order between characters is, however, not determined by the code values, but on the user's expectations of the ``correct'' order between characters. In addition, such a (simple) collation order can specify that certain characters collate equal (e.g., upper- and lowercase letters). (3) String ordering. On this level, entire strings are compared based on relatively straightforward rules. At this level, several ``passes'' may be required to determine the order between two strings. Characters may be ignored in some passes, but not in others; the strings may be compared in different directions; and simple string substitutions may be made before strings are compared. This level is best described as ``dictionary'' ordering; it is based on the spelling, not the pronunciation, or meaning, of the words. (4) Text search ordering. This is a further refinement of the previous level, best described as ``telephone book ordering''; 1 some common homonyms (words spelled differently but with same 1 pronunciation) are collated together; numbers are collated as if spelled with words, and so on. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 93 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX (5) Semantic level ordering. Words and strings are collated based on their meaning; entire words (such as ``the'') are eliminated, the ordering is not deterministic. This usually requires special software, and is highly dependent on the intended use. While the historical collation order formally is at level 1, for the English language it corresponds roughly to elements at level 2. The user expects to see the output from the ls utility sorted very much as as it would be in a dictionary. While telephone book ordering would be an optimal goal for standard collation, this was ruled out as the order would be language dependent. Furthermore, a requirement was that the order must be determined solely from the text string and the collation rules; no external information (e.g., ``pronunciation dictionaries'') could be required. As a result, the goal for the collation support is at level 3. This also matches the requirements for the proposed Canadian collation order, as well as other, known collation requirements for alphabetic scripts. It specifically rules out collation based on pronunciation rules, or based on semantic analysis of the text. The syntax for the LC_COLLATE category source is the result of a cooperative effort between representatives for many countries and organizations working with international issues, such as UniForum, X/Open, and ISO, and it meets the requirements for level 3, and has been verified to produce the correct result with examples based on French, Canadian, and Danish collation order, as well as meeting the requirements in the X/Open Portability Guide, Issue 3. {B31}. Because it supports multicharacter collating elements, it is also capable of supporting collation in code sets where a character is expressed using nonspacing characters followed by the base character (such as ISO 6937 {B6}). The directives that can be specified in an operand to the order_start 2 keyword are based on the requirements specified in several proposed 2 standards and in customary use. The following is a rephrasing of rules 2 defined for ``lexical ordering in English and French'' by the Canadian 2 Standards Association (text is brackets is rephrased): 2 (1) Once special characters ([punctuation]) have been removed from 2 original strings, the ordering is determined by scanning forward 2 (left to right) [disregarding case and diacriticals]. 2 (2) In case of equivalence, special characters are once again 2 removed from original strings and the ordering is determined 2 scanning backward (starting from the rightmost character of the 2 string and back), character by character, [disregarding case but 2 considering diacriticals]. 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 94 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 (3) In case of repeated equivalence, special characters are removed 2 again from original strings and the ordering is determined 2 scanning forward, character by character, [considering both case 2 and diacriticals]. 2 (4) If there is still an ordering equivalence after rules (1) 2 through (3) have been applied, then only special characters and 2 the position they occupy in the string are considered to 2 determine ordering. The string that has a special character in 2 the lowest position comes first. If two strings have a special 2 character in the same position, the character [with the lowest 2 collation value] comes first. In case of equality, the other 2 special characters are considered until there is a difference or 2 all special characters have been exhausted. 2 It is estimated that the standard covers the requirements for all European languages, and no particular problems are anticipated with Slavic or Middle East character sets. The Far East (particularly Japanese/Chinese) collations are often based on contextual information and pronunciation rules (the same ideogram can have different meanings and different pronunciations). Such collation, in general, falls outside the desired goal of the standard. There are, however, several other collation rules (stroke/radical, or ``most common pronunciation'') which can be supported with the mechanism described here. Previous drafts contained a substitute statement, which performed a 2 regular expression style replacement before string compares. It has been 2 withdrawn based on balloter objections that it was not required for the 2 types of ordering POSIX.2 is aimed at. 2 The character (and collating element) order is defined by the order in 2 which characters and elements are specified between the order_start and 2 order_end keywords. This character order is used in range expressions in 2 regular expressions (see 2.8). Weights assigned to the characters and 2 elements defines the collation sequence; in the absence of weights, the 2 character order is also the collation sequence. 2 The position keyword was introduced to provide the capability to 1 consider, in a compare, the relative position of non-IGNORE_d characters. 1 As an example, consider the two strings ``o-ring'' and ``or-ing''. 1 Assuming the hyphen is IGNORE_d on the first pass, the two strings will 1 compare equal, and the position of the hyphen is immaterial. On second 1 pass, all characters except the hyphen are IGNORE_d, and in the normal 1 case the two strings would again compare equal. By taking position into 1 account, the first collates before the second. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 95 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX END_RATIONALE 1 2.5.2.3 LC_MONETARY Table 2-8 - LC_MONETARY Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_MONETARY # This is the POSIX Locale definition for # the LC_MONETARY category. # int_curr_symbol "" currency_symbol "" mon_decimal_point "" mon_thousands_sep "" mon_grouping "" positive_sign "" negative_sign "" int_frac_digits -1 p_cs_precedes -1 p_sep_by_space -1 n_cs_precedes -1 n_sep_by_space -1 p_sign_posn -1 n_sign_posn -1 # END LC_MONETARY __________________________________________________________________________________________________________________________________________________ The LC_MONETARY category shall define the rules and symbols that shall be used to format monetary numeric information. The operands are strings. For some keywords, the strings can contain only integers. Keywords that are not provided, string values set to the empty string (""), or integer 1 keywords set to -1, shall be used to indicate that the value is 1 unspecified. The following keywords shall be recognized: copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. int_curr_symbol The international currency symbol. The operand shall be a four-character string, with the first three characters containing the alphabetic international currency symbol in accordance with those specified in ISO 4217 {3} (_C_o_d_e_s _f_o_r _t_h_e _r_e_p_r_e_s_e_n_t_a_t_i_o_n _o_f _c_u_r_r_e_n_c_i_e_s _a_n_d _f_u_n_d_s). The fourth character shall be the character used to separate the international currency symbol from the monetary quantity. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 96 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 currency_symbol The string that shall be used as the local currency symbol. mon_decimal_point The operand is a string containing the symbol 2 that shall be used as the decimal delimiter in 2 monetary formatted quantities. In contexts 2 where other standards limit the 2 mon_decimal_point to a single byte, the result 2 of specifying a multibyte operand is 2 unspecified. 2 mon_thousands_sep The operand is a string containing the symbol 2 that shall be used as a separator for groups of 2 digits to the left of the decimal delimiter in 2 formatted monetary quantities. In contexts 2 where other standards limit the 2 mon_thousands_sep to a single byte, the result 2 of specifying a multibyte operand is 2 unspecified. 2 mon_grouping Define the size of each group of digits in formatted monetary quantities. The operand is a sequence of integers separated by semicolons. Each integer specifies the number of digits in each group, with the initial integer defining the size of the group immediately preceding the decimal delimiter, and the following integers defining the preceding groups. If the last 2 integer is not -1, then the size of the previous 2 group (if any) shall be repeatedly used for the 2 remainder of the digits. If the last integer is 2 -1, then no further grouping shall be performed. 2 positive_sign A string that shall be used to indicate a nonnegative-valued formatted monetary quantity. negative_sign A string that shall be used to indicate a negative-valued formatted monetary quantity. int_frac_digits An integer representing the number of fractional digits (those to the right of the decimal delimiter) to be written in a formatted monetary quantity using int_curr_symbol. frac_digits An integer representing the number of fractional digits (those to the right of the decimal delimiter) to be written in a formatted monetary quantity using currency_symbol. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 97 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX p_cs_precedes An integer set to 1 if the currency_symbol or int_curr_symbol precedes the value for a nonnegative formatted monetary quantity, and set to 0 if the symbol succeeds the value. p_sep_by_space An integer set to 0 if no space separates the currency_symbol or int_curr_symbol from the value for a nonnegative formatted monetary quantity, set to 1 if a space separates the symbol from the value, and set to 2 if a space separates the symbol and the sign string, if adjacent. n_cs_precedes An integer set to 1 if the currency_symbol or int_curr_symbol precedes the value for a negative formatted monetary quantity, and set to 0 if the symbol succeeds the value. n_sep_by_space An integer set to 0 if no space separates the currency_symbol or int_curr_symbol from the value for a negative formatted monetary quantity, set to 1 if a space separates the symbol from the value, and set to 2 if a space separates the symbol and the sign string, if adjacent. p_sign_posn An integer set to a value indicating the positioning of the positive_sign for a nonnegative formatted monetary quantity. The following integer values shall be recognized: 0 Parentheses enclose the quantity and the currency_symbol or int_curr_symbol. 1 The sign string precedes the quantity and the currency_symbol or int_curr_symbol. 2 The sign string succeeds the quantity and the currency_symbol or int_curr_symbol. 3 The sign string immediately precedes the currency_symbol or int_curr_symbol. 4 The sign string immediately succeeds the currency_symbol or int_curr_symbol. n_sign_posn An integer set to a value indicating the positioning of the negative_sign for a negative 1 formatted monetary quantity. The following Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 98 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 integer values shall be recognized: 0 Parentheses enclose the quantity and the currency_symbol or int_curr_symbol. 1 The sign string precedes the quantity and the currency_symbol or int_curr_symbol. 2 The sign string succeeds the quantity and the currency_symbol or int_curr_symbol. 3 The sign string immediately precedes the currency_symbol or int_curr_symbol. 4 The sign string immediately succeeds the currency_symbol or int_curr_symbol. BEGIN_RATIONALE 2.5.2.3.1 LC_MONETARY Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The currency symbol does not appear in LC_MONETARY because it is not defined in the C Standard's {7} C locale. The C Standard {7} limits the size of decimal points and thousands 2 delimiters to single-byte values. In locales based on multibyte coded 2 character sets this cannot be enforced, obviously; this standard does not 2 prohibit such characters, but makes the behavior unspecified [in the text 2 ``In contexts where other standards ...'']. 2 The grouping specification is based on, but not identical to, the 2 C Standard {7}. The ``-1'' signals that no further grouping shall be 2 performed, the equivalent of {CHAR_MAX} in the C Standard {7}). 2 The locale definition is an extension of the C Standard {7} _l_o_c_a_l_e_c_o_n_v() specification. In particular, rules on how currency_symbol is treated are extended to also cover int_curr_symbol, and p_set_by_space and n_sep_by_space have been augmented with the value 2, which places a space between the sign and the symbol (if they are adjacent; otherwise it should be treated as a 0). The following table shows the result of various combinations: Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 99 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX p_sep_by_space 2 1 0 p_cs_precedes = 1 p_sign_posn = 0 ($1.25) ($ 1.25) ($1.25) p_sign_posn = 1 + $1.25 +$ 1.25 +$1.25 p_sign_posn = 2 $1.25 + $ 1.25+ $1.25+ p_sign_posn = 3 + $1.25 +$ 1.25 +$1.25 p_sign_posn = 4 $ +1.25 $+ 1.25 $+1.25 p_cs_precedes = 0 p_sign_posn = 0 (1.25 $) (1.25 $) (1.25$) p_sign_posn = 1 +1.25 $ +1.25 $ +1.25$ p_sign_posn = 2 1.25$ + 1.25 $+ 1.25$+ p_sign_posn = 3 1.25+ $ 1.25 +$ 1.25+$ p_sign_posn = 4 1.25$ + 1.25 $+ 1.25$+ The following is an example of the interpretation of the mon_grouping keyword. Assuming that the value to be formatted is 123456789 and the mon_thousands_sep is ', then the following table shows the result. The 1 third column shows the equivalent C Standard {7} string that would be 1 used to accommodate this grouping. It is the responsibility of the 1 utility to perform mappings of the formats in this clause to those used 1 by language bindings such as the C Standard {7}. 1 mon_grouping Formatted Value C Standard {7} String 1 ____________ _______________ _____________________ 1 3;-1 123456'789 "\3\177" 2 3 123'456'789 "\3" 2 3;2;-1 1234'56'789 "\3\2\177" 2 3;2 12'34'56'789 "\3\2" 2 -1 123456789 "177" 2 In these examples, the octal value of {CHAR_MAX} is 177. 2 END_RATIONALE 2.5.2.4 LC_NUMERIC The LC_NUMERIC category shall define the rules and symbols that shall be used to format nonmonetary numeric information. The operands are strings. For some keywords, the strings only can contain integers. Keywords that are not provided, string values set to the empty string 1 (""), or integer keywords set to -1, shall be used to indicate that the 1 value is unspecified. The following keywords shall be recognized: copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 100 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 decimal_point The operand is a string containing the symbol that 2 shall be used as the decimal delimiter in numeric, 2 nonmonetary formatted quantities. This keyword 2 cannot be omitted and cannot be set to the empty 2 string. In contexts where other standards limit 2 the decimal_point to a single byte, the result of 2 specifying a multibyte operand is unspecified. 2 thousands_sep The operand is a string containing the symbol that 2 shall be used as a separator for groups of digits 2 to the left of the decimal delimiter in numeric, 2 nonmonetary formatted monetary quantities. In 2 contexts where other standards limit the 2 thousands_sep to a single byte, the result of 2 specifying a multibyte operand is unspecified. 2 grouping Define the size of each group of digits in formatted nonmonetary quantities. The operand is a sequence of integers separated by semicolons. Each integer specifies the number of digits in each group, with the initial integer defining the size of the group immediately preceding the decimal delimiter, and the following integers defining the preceding groups. If the last integer is not -1, 2 then the size of the previous group (if any) shall 2 be repeatedly used for the remainder of the digits. 2 If the last integer is -1, then no further grouping 2 shall be performed. 2 Table 2-9 - LC_NUMERIC Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_NUMERIC # This is the POSIX Locale definition for # the LC_NUMERIC category. # decimal_point "" 2 thousands_sep "" grouping 0 # END LC_NUMERIC __________________________________________________________________________________________________________________________________________________ BEGIN_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 101 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.5.2.4.1 LC_NUMERIC Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) See the rationale for LC_MONETARY (2.5.2.3.1) for a description of the 1 behavior of grouping. 1 END_RATIONALE 1 2.5.2.5 LC_TIME The LC_TIME category shall define the interpretation of the field descriptors supported by the date utility (see 4.15). Table 2-10 - LC_TIME Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_TIME # This is the POSIX Locale definition for # the LC_TIME category. # # Abbreviated weekday names (%a) abday "";"";"";"";\ "";"";"" # # Full weekday names (%A) day "";"";\ "";"";\ "";"";\ "" # # Abbreviated month names (%b) abmon "";"";"";\ "

";"";"";\ "";"";"

";\ "";"";"" # # Full month names (%B) mon "";"";\ "";"

";\ "";"";\ "";"";\ "

";"";\ "";"" # # Equivalent of AM/PM (%p) "AM";"PM" am_pm "";"

" # # Appropriate date and time representation (%c) # "%a %b %e %H:%M:%S %Y" 1 d_t_fmt "\1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 102 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 \ " # # Appropriate date representation (%x) "%m/%d/%y" d_fmt "" # # Appropriate time representation (%X) "%H:%M:%S" t_fmt "" # # Appropriate 12-hour time representation (%r) "%I:%M:%S %p" t_fmt_ampm "\

" # END LC_TIME __________________________________________________________________________________________________________________________________________________ The following mandatory keywords shall be recognized: copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. abday Define the abbreviated weekday names, corresponding to the %a field descriptor. The operand shall consist of seven semicolon-separated strings. The first string shall be the abbreviated name of the first day of the week (Sunday), the second the abbreviated name of the second day, and so on. day Define the full weekday names, corresponding to the %A field descriptor. The operand shall consist of seven semicolon-separated strings. The first string shall be the full name of the first day of the week (Sunday), the second the full name of the second day, and so on. abmon Define the abbreviated month names, corresponding to the %b field descriptor. The operand shall consist of twelve semicolon-separated strings. The first string shall be the abbreviated name of the first month of the year (January), the second the abbreviated name of the second month, and so on. mon Define the full month names, corresponding to the %B field descriptor. The operand shall consist of twelve semicolon-separated strings. The first string shall be the full name of the first month of the year (January), the Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 103 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX second the full name of the second month, and so on. d_t_fmt Define the appropriate date and time representation, corresponding to the %c field descriptor. The operand shall consist of a string, and can contain any combination of characters and field descriptors. In addition, the string can contain escape sequences defined in Table 2-15. 1 d_fmt Define the appropriate date representation, corresponding to the %x field descriptor. The operand shall consist of a string, and can contain any combination of characters and field descriptors. In addition, the string can contain escape sequences defined in Table 2-15. 1 t_fmt Define the appropriate time representation, corresponding to the %X field descriptor. The operand shall consist of a string, and can contain any combination of characters and field descriptors. In addition, the string can contain escape sequences defined in Table 2-15. 1 am_pm Define the appropriate representation of the _a_n_t_e _m_e_r_i_d_i_e_m and _p_o_s_t _m_e_r_i_d_i_e_m strings, corresponding to the %p field descriptor. The operand shall consist of two strings, separated by a semicolon. The first string shall represent the _a_n_t_e _m_e_r_i_d_i_e_m designation, the last string the _p_o_s_t _m_e_r_i_d_i_e_m designation. t_fmt_ampm Define the appropriate time representation in the 12-hour clock format with am_pm, corresponding to the %r field descriptor. The operand shall consist of a string and can contain any combination of characters and field descriptors. If the string is empty, the 12-hour format is not supported in the locale. It is implementation defined whether the following optional keywords shall be recognized. If they are not supported, but present in a localedef source, they shall be ignored. era Shall be used to define alternate Eras, corresponding to the %E field descriptor modifier. The format of the operand is unspecified, but shall support the definition of the %EC and %Ey field descriptors, and may also define the era_year format (%EY). era_year Shall be used to define the format of the year in alternate Era format, corresponding to the %EY field descriptor. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 104 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 era_d_fmt Shall be used to define the format of the date in alternate Era notation, corresponding to the %Ex field descriptor. alt_digits Shall be used to define alternate symbols for digits, corresponding to the %O field descriptor modifier. The operand shall consist of semicolon-separated strings. The first string shall be the alternate symbol corresponding with zero, the second string the symbol corresponding with one, and so on. Up to 100 alternate symbol strings can be specified. The %O modifier indicates that the string corresponding to the value specified via the field descriptor shall be used instead of the value. BEGIN_RATIONALE 2.5.2.5.1 LC_TIME Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Although certain of the field descriptors in the POSIX Locale (such as the name of the month) are shown with initial capital letters, this need not be the case in other locales. Programs using these fields may need to adjust the capitalization if the output is going to be used at the beginning of a sentence. The LC_TIME descriptions of abday, daya, and abmon imply a Gregorian 1 style calendar (7-day weeks, 12-month years, leap years, etc.). 1 Formatting time strings for other types of calendars is outside the scope 1 of this standard. 1 As specified under the date command, the field descriptors corresponding to the optional keywords consist of a modifier followed by a traditional field descriptor (for instance %Ex). If the optional keywords are not supported by the implementation or are unspecified for the current locale, these field descriptors shall be treated as the traditional field descriptor. For instance, assume the following keywords: alt_digits "0th";"1st";"2nd";"3rd";"4th";"5th";\ 1 "6th";"7th";"8th";"9th";"10th" 1 d_fmt "The %Od day of %B in %Y" 1 On 7/4/1776, the %x field descriptor would result in ``The 4th day of 1 July in 1776,'' while 7/14/1789 would come out as ``The 14 day of July in 1789.'' It can be noted that the above example is for illustrative purposes only; the %O modifier is primarily intended to provide for Kanji or Hindi digits in date formats. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 105 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX While it is clear that an alternate year format is required, there is no consensus on the format or the requirements. As a result, while these keywords are reserved, the details are left unspecified. It is expected that National Standards Bodies will provide specifications. END_RATIONALE 2.5.2.6 LC_MESSAGES The LC_MESSAGES category shall define the format and values for affirmative and negative responses. The operands shall be strings or extended regular expressions; see 2.8.4. The following keywords shall be recognized: copy Specify the name of an existing locale to be used as the source for the definition of this category. If this keyword is specified, no other keyword shall be specified. yesexpr The operand shall consist of an extended regular expression that describes the acceptable affirmative response to a question expecting an affirmative or negative response. noexpr The operand shall consist of an extended regular expression that describes the acceptable negative response to a question expecting an affirmative or negative response. Table 2-11 - LC_MESSAGES Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_MESSAGES # This is the POSIX Locale definition for # the LC_MESSAGES category. # yesexpr "" # noexpr "" END LC_MESSAGES __________________________________________________________________________________________________________________________________________________ BEGIN_RATIONALE 2.5.2.6.1 LC_MESSAGES Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The LC_MESSAGES category is described in 2.6 as affecting the language used by utilities for their output. The mechanism used by the implementation to accomplish this, other than the responses shown here in the locale definition file, is not specified by this version of this standard. The POSIX.1 working group is developing an interface that Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 106 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 would allow applications (and, presumably some of the standard utilities) to access messages from various message catalogs, tailored to a user's LC_MESSAGES value. END_RATIONALE 2.5.3 Locale Definition Grammar 1 The grammar and lexical conventions in this subclause shall together 1 describe the syntax for the locale definition source. The general 1 conventions for this style of grammar are described in 2.1.2. Any 1 discrepancies found between this grammar and other descriptions in this 1 clause shall be resolved in favor of this grammar. 1 2.5.3.1 Locale Lexical Conventions 1 The lexical conventions for the locale definition grammar are described 1 in this subclause. 1 The following tokens shall be processed (in addition to those string 1 constants shown in the grammar): 1 LOC_NAME A string of characters representing the name of a 1 locale. 1 CHAR Any single character. 1 NUMBER A decimal number, represented by one or more decimal 2 digits. 2 COLLSYMBOL A symbolic name, enclosed between angle brackets. The 1 string shall not duplicate any charmap symbol defined 1 in the current charmap (if any), or a COLLELEMENT 1 symbol. 1 COLLELEMENT A symbolic name, enclosed between angle brackets, which 1 shall not duplicate either any charmap symbol or a 1 CHARSYMBOL symbol. 1 CHARSYMBOL A symbolic name, enclosed between angle brackets, from 1 the current charmap (if any). 1 OCTAL_CHAR One or more octal representations of the encoding of 1 each byte in a single character. The octal 1 representation consists of an escape_char (normally a 1 backslash) followed by two or more octal digits. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 107 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX HEX_CHAR One or more hexadecimal representations of the encoding 1 of each byte in a single character. The hexadecimal 1 representation consists of an escape_char followed by 1 the constant 'x' and two or more hexadecimal digits. 1 DECIMAL_CHAR One or more decimal representations of the encoding of 1 each byte in a single character. The decimal 1 representation consists of an escape_char and followed 1 by a 'd' and two or more decimal digits. 1 ELLIPSIS The string ``...''. 1 2 EXTENDED_REG_EXP 1 An extended regular expression as defined in the 1 grammar in 2.8.5.2. 1 2 EOL The line termination character . 1 2.5.3.2 Locale Grammar 1 This subclause presents the grammar for the locale definition. 1 %token LOC_NAME 1 %token CHAR 1 %token NUMBER 2 %token COLLSYMBOL COLLELEMENT 1 %token CHARSYMBOL OCTAL_CHAR HEX_CHAR DECIMAL_CHAR 1 %token ELLIPSIS 1 %token EXTENDED_REG_EXP 2 %token EOL 1 %start locale_definition 1 %% 1 locale_definition : global_statements locale_categories 2 | locale_categories 2 ; 1 global_statements : global_statements symbol_redefine 2 | symbol_redefine 2 ; 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 108 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 symbol_redefine : '#escape_char' CHAR EOL 1 | '#comment_char' CHAR EOL 1 ; 1 locale_categories : locale_categories locale_category 2 | locale_category 2 ; 1 locale_category : lc_ctype | lc_collate | lc_messages 1 | lc_monetary | lc_numeric | lc_time 1 ; 1 /* The following grammar rules are common to all categories */ 1 char_list : char_list char_symbol 2 | char_symbol 2 ; 1 char_symbol : CHAR | CHARSYMBOL 1 | OCTAL_CHAR | HEX_CHAR | DECIMAL_CHAR 1 ; 1 locale_name : LOC_NAME 1 | '"' LOC_NAME '"' 1 ; 1 /* The following is the LC_CTYPE category grammar */ 1 lc_ctype : ctype_hdr ctype_keywords ctype_tlr 2 | ctype_hdr 'copy' locale_name EOL ctype_tlr 2 ; 2 ctype_hdr : 'LC_CTYPE' EOL 2 ; 2 ctype_keywords : ctype_keywords ctype_keyword 2 | ctype_keyword 2 ; 1 ctype_keyword : charclass_keyword charclass_list EOL 1 | charconv_keyword charconv_list EOL 1 ; 1 charclass_keyword : 'upper' | 'lower' | 'alpha' | 'digit' 1 | 'alnum' | 'xdigit' | 'space' | 'print' 1 | 'graph' | 'blank' | 'cntrl' 1 ; 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 109 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX charclass_list : charclass_list ';' char_symbol 2 | charclass_list ';' ELLIPSIS ';' char_symbol 1 | char_symbol 2 ; 1 charconv_keyword : 'toupper' 1 | 'tolower' 1 ; 1 charconv_list : charconv_list ';' charconv_entry 2 | charconv_entry 2 ; 1 charconv_entry : '(' char_symbol ',' char_symbol ')' 1 ; 1 ctype_tlr : 'END' 'LC_CTYPE' EOL 2 ; 1 /* The following is the LC_COLLATE category grammar */ 1 lc_collate : collate_hdr collate_keywords collate_tlr 2 | collate_hdr 'copy' locale_name EOL collate_tlr 2 ; 2 collate_hdr : 'LC_COLLATE' EOL 2 ; 2 collate_keywords : order_statements 2 | opt_statements order_statements 2 ; 1 opt_statements : opt_statements collating_symbols 2 | opt_statements collating_elements 2 | collating_symbols 1 | collating_elements 1 ; 1 collating_symbols : 'collating-symbol' COLLSYMBOL EOL 1 ; 1 collating_elements : 'collating-element' COLLELEMENT 1 'from' '"' char_list '"' EOL 2 ; 1 2 order_statements : order_start collation_order order_end 1 ; 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 110 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 order_start : 'order_start' EOL 1 | 'order_start' order_opts EOL 1 ; 1 order_opts : order_opts ';' order_opt 2 | order_opt 2 ; 1 order_opt : order_opt ',' opt_word 2 | opt_word 2 ; 1 opt_word : 'forward' | 'backward' | 'position' 2 ; 1 collation_order : collation_order collation_entry 2 | collation_entry 2 ; 1 collation_entry : COLLSYMBOL EOL 1 | collation_element weight_list EOL 1 | collation_element EOL 2 ; 1 collation_element : char_symbol 1 | COLLELEMENT 1 | ELLIPSIS 1 | 'UNDEFINED' 1 ; 1 weight_list : weight_list ';' weight_symbol 2 | weight_list ';' 2 | weight_symbol 2 ; 1 weight_symbol : char_symbol 2 | COLLSYMBOL 1 | '"' char_list '"' 1 | ELLIPSIS 1 | 'IGNORE' 1 ; 1 order_end : 'order_end' EOL 1 ; 1 collate_tlr : 'END' 'LC_COLLATE' EOL 2 ; 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 111 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX /* The following is the LC_MESSAGES category grammar */ 1 lc_messages : messages_hdr messages_keywords messages_tlr 2 | messages_hdr 'copy' locale_name EOL messages_tlr 2 ; 2 messages_hdr : 'LC_MESSAGES' EOL 2 ; 2 messages_keywords : messages_keywords messages_keyword 2 | messages_keyword 2 ; 1 messages_keyword : 'yesexpr' '"' EXTENDED_REG_EXP '"' EOL 2 | 'noexpr' '"' EXTENDED_REG_EXP '"' EOL 2 ; 2 messages_tlr : 'END' 'LC_MESSAGES' EOL 2 ; 1 /* The following is the LC_MONETARY category grammar */ 1 lc_monetary : monetary_hdr monetary_keywords monetary_tlr2 | monetary_hdr 'copy' locale_name EOL monetary_tlr2 ; 2 monetary_hdr : 'LC_MONETARY' EOL 2 ; 2 monetary_keywords : monetary_keywords monetary_keyword 2 | monetary_keyword 2 ; 1 monetary_keyword : mon_keyword_string mon_string EOL 1 | mon_keyword_char NUMBER EOL 2 | mon_keyword_char '-1' EOL 2 | mon_keyword_grouping mon_group_list EOL 1 ; 1 mon_keyword_string : 'int_curr_symbol' | 'currency_symbol' 1 | 'mon_decimal_point' | 'mon_thousands_sep' 1 | 'positive_sign' | 'negative_sign' 1 ; 1 mon_string : '"' char_list '"' 1 | '""' 1 ; 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 112 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 mon_keyword_char : 'int_frac_digits' | 'frac_digits' 1 | 'p_cs_precedes' | 'p_sep_by_space' 1 | 'n_cs_precedes' | 'n_sep_by_space' 1 | 'p_sign_posn' | 'n_sign_posn' 1 ; 1 2 mon_keyword_grouping : 'mon_grouping' 1 ; 1 mon_group_list : NUMBER 2 | mon_group_list ';' NUMBER 2 ; 2 monetary_tlr : 'END' 'LC_MONETARY' EOL 2 ; 2 /* The following is the LC_NUMERIC category grammar */ 2 lc_numeric : numeric_hdr numeric_keywords numeric_tlr 2 | numeric_hdr 'copy' locale_name EOL numeric_tlr 2 ; 2 numeric_hdr : 'LC_NUMERIC' EOL 2 ; 2 numeric_keywords : numeric_keywords numeric_keyword 2 | numeric_keyword 2 ; 1 numeric_keyword : num_keyword_string num_string EOL 1 | num_keyword_grouping num_group_list EOL 1 ; 1 num_keyword_string : 'decimal_point' 1 | 'thousands_sep' 1 ; 1 num_string : '"' char_list '"' 1 | '""' 1 ; 1 num_keyword_grouping : 'num_grouping' 1 ; 1 num_group_list : NUMBER 2 | num_group_list ';' NUMBER 2 ; 1 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 113 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX numeric_tlr : 'END' 'LC_NUMERIC' EOL 2 ; 1 /* The following is the LC_TIME category grammar */ 1 lc_time : time_hdr time_keywords time_tlr 2 | time_hdr 'copy' locale_name EOL time_tlr 2 ; 1 time_hdr : 'LC_TIME' EOL 2 ; 1 time_keywords : time_keywords time_keyword 2 | time_keyword 2 ; 1 time_keyword : time_keyword_name time_list EOL 2 | time_keyword_fmt time_string EOL 1 | time_keyword_opt time_list EOL 1 ; 1 time_keyword_name : 'abday' | 'day' | 'abmon' | 'mon' 2 ; 1 time_keyword_fmt : 'd_t_fmt' | 'd_fmt' | 't_fmt' | 'am_pm' | 't_fmt_ampm'1 ; 1 time_keyword_opt : 'era' | 'era_year' | 'era_d_fmt' | 'alt_digits' 1 ; 1 time_list : time_list ';' time_string 2 | time_string 2 ; 1 time_string : '"' char_list '"' 1 ; 1 time_tlr : 'END' 'LC_TIME' EOL 2 ; 1 BEGIN_RATIONALE 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 114 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.5.4 Locale Definition Example. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The following is an example of a locale definition file that could be used as input to the localedef utility. It assumes that the utility is executed with the -f option, naming a _c_h_a_r_m_a_p file with (at least) the following content: CHARMAP \x20 \x24 \101 \141 \346 \365 \300 1 \366 \142 \103 \143 \347 \x64 \110 \150 \xb7 \x73 \x7a END CHARMAP It should not be taken as complete or to represent any actual locale, but only to illustrate the syntax. A further set of examples is offered as part of Annex F. # LC_CTYPE lower ;;;;;...; upper A;B;C;C,;...;Z space \x20;\x09;\x0a;\x0b;\x0c;\x0d 1 blank \040;\011 toupper (,);(b,B);(c,C);(c,,C,);(d,D);(z,Z) END LC_CTYPE # LC_COLLATE # # The following example of collation is based on the proposed 1 # Canadian standard Z243.4.1-1990, "Canadian Alphanumeric 1 # Ordering Standard For Character sets of CSA Z234.4 Standard". 1 # (Other parts of this example locale definition file do not 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 115 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX # purport to relate to Canada, or to any other real culture.) 1 # The proposed standard defines a 4-weight collation, such that # in the first pass, characters are compared without regard to # case or accents; in second pass, backwards compare without # regard to case; in the third pass, forward compare without # regard to diacriticals. In the 3 first passes, non-alphabetic 2 # characters are ignored; in the fourth pass, only special # characters are considered, such that "The string that has a # special character in the lowest position comes first. If two # strings have a special character in the same position, the # collation value of the special character determines ordering. # # Only a subset of the character set is used here; mostly to # illustrate the set-up. # 2 # collating-symbol 2 collating-symbol collating-symbol collating-symbol collating-symbol collating-symbol collating-symbol collating-symbol collating-symbol collating-symbol # Further collating-symbols follow. # # Properly, the standard does not include any multi-character # collating elements; the one below is added for completeness. # collating_element from collating_element from collating_element from # order_start forward;backward;forward;forward,position # # Collating symbols are specified first in the sequence to allocate # basic collation values to them, lower that than of any character. 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 116 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 # Further collating symbols are given a basic collating value here. # # Here follows special characters. IGNORE;IGNORE;IGNORE; # Other special characters follow here. # # Here comes the regular characters. ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE ;;;IGNORE # # As an example, the strings "Bach" and "bach" could be encoded (for # compare purposes) as: # "Bach" ;;;;;;\ 2 # ;;;;; 2 # "bach" ;;;;;;\ 2 # ;;;;; 2 # # The two strings are equal in pass 1 and 2, but differ in pass 3. # # Further characters follow. # UNDEFINED IGNORE;IGNORE;IGNORE;IGNORE # order_end # END LC_COLLATE # LC_MONETARY int_curr_symbol "USD " currency_symbol "$" mon_decimal_point "." Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 117 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX mon_grouping 3;0 positive_sign "" negative_sign "-" p_cs_precedes 1 n_sign_posn 0 END LC_MONETARY # LC_NUMERIC copy "US_en.ASCII" 1 END LC_NUMERIC # LC_TIME abday "Sun";"Mon";"Tue";"Wed";"Thu";"Fri";"Sat" # day "Sunday";"Monday";"Tuesday";"Wednesday";\ "Thursday";"Friday";"Saturday" # abmon "Jan";"Feb";"Mar";"Apr";"May";"Jun";\ "Jul";"Aug";"Sep";"Oct";"Nov";"Dec" # mon "January";"February";"March";"April";\ "May";"June";"July";"August";"September";\ "October";"November";"December" # d_t_fmt "%a %b %d %T %Z %Y\n" END LC_TIME # LC_MESSAGES yesexpr "^([yY][[:alpha:]]*)|(OK)" 1 # noexpr "^[nN][[:alpha:]]*" 1 END LC_MESSAGES END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 118 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.6 Environment Variables Environment variables defined in this clause affect the operation of multiple utilities and applications. There are other environment variables that are of interest only to specific utilities. Environment variables that apply to a single utility only are defined as part of the utility description. See the Environment Variables subclause of the utility descriptions for information on environment variable usage. The value of an environment variable is a string of characters, as described in 2.7 in POSIX.1 {8}. Environment variable names used by the standard utilities shall consist solely of uppercase letters, digits, and the _ (underscore) from the characters defined in 2.4. The namespace of environment variable names containing lowercase letters shall be reserved for applications. Applications can define any environment variables with names from this namespace without modifying the behavior of the standard utilities. If the following variables are present in the environment during the execution of an application or utility, they are given the meaning described below. They may be put into the environment, or changed, by either the implementation or the user. If they are defined in the utility's environment, the standard utilities assume they have the specified meaning. Conforming applications shall not set these environment variables to have meanings other than as described. See 7.2 and 3.12 for methods of accessing these variables. HOME A pathname of the user's home directory. LANG This variable shall determine the locale category for 1 any category not specifically selected via a variable 1 starting with LC_. LANG and the LC_ variables can be 1 used by applications to determine the language for messages and instructions, collating sequences, date formats, etc. Additional semantics of this variable, if any, are implementation defined. LC_ALL This variable shall override the value of the LANG variable and the value of any of the other variables starting with LC_. LC_COLLATE This variable shall determine the locale category for character collation information within bracketed regular expressions and for sorting. This environment variable determines the behavior of ranges, equivalence classes, and multicharacter collating elements. Additional semantics of this variable, if any, are implementation defined. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.6 Environment Variables 119 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX LC_CTYPE This variable shall determine the locale category for character handling functions. This environment variable shall determine the interpretation of sequences of bytes of text data as characters (e.g., single- versus multibyte characters), the classification of characters (e.g., alpha, digit, graph), and the behavior of character classes. Additional semantics of this variable, if any, are implementation defined. LC_MESSAGES This variable shall determine the locale category for processing affirmative and negative responses and the language and cultural conventions in which messages should be written. Additional semantics of this variable, if any, are implementation defined. The language and cultural conventions of diagnostic and informative messages whose format is unspecified by this standard should be affected by the setting of LC_MESSAGES. LC_MONETARY This variable shall determine the locale category for monetary-related numeric formatting information. Additional semantics of this variable, if any, are implementation defined. LC_NUMERIC This variable shall determine the locale category for numeric formatting (for example, thousands separator and radix character) information. Additional semantics of this variable, if any, are implementation defined. LC_TIME This variable shall determine the locale category for date and time formatting information. Additional semantics of this variable, if any, are implementation defined. LOGNAME The user's login name. PATH The sequence of path prefixes that certain functions and utilities apply in searching for an executable file known only by a filename. The prefixes shall be separated by a colon (:). When a nonzero-length prefix is applied to this filename, a slash shall be inserted between the prefix and the filename. A zero-length prefix is an obsolescent feature that indicates the current working directory. It appears as two adjacent colons (::), as an initial colon preceding the rest of the list, or as a trailing colon following the rest of the list. A Strictly Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 120 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 Conforming POSIX.2 Application shall use an actual pathname (such as '.') to represent the current working directory in PATH. The list shall be searched from beginning to end, applying the filename to each prefix, until an executable file with the specified name and appropriate execution permissions is found. If the pathname being sought contains a slash, the search through the path prefixes shall not be performed. If the pathname begins with a slash, the specified path shall be resolved as described in 2.2.2.104. If PATH is unset or is set to null, the path search is implementation-defined. SHELL A pathname of the user's preferred command language interpreter. If this interpreter does not conform to the shell command language in Section 3, utilities may behave differently than described in this standard. TMPDIR A pathname of a directory made available for programs that need a place to create temporary files. TERM The terminal type for which output is to be prepared. This information is used by utilities and application programs wishing to exploit special capabilities specific to a terminal. The format and allowable values of this environment variable are unspecified. TZ Time-zone information. The format is described in POSIX.1 {8} 8.1.1. The environment variables LANG, LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME (LC_*) provide for the support of internationalized applications. The standard utilities shall make use of these environment variables as described in this clause and the individual Environment Variables subclauses for the utilities. If these variables specify locale categories that are not based upon the same underlying code set, the results are unspecified. For utilities used in internationalized applications, if the LC_ALL is not set in the environment or is set to the empty string, and if any of LC_* variables is not set in the environment or is set to the empty string, the operational behavior of the utility for the corresponding locale category shall be determined by the setting of the LANG environment variable. If the LANG environment variable is not set or is set to the empty string, the implementation-defined default locale shall be used. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.6 Environment Variables 121 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX If LANG (or any of the LC_* environment variables) contains the value "C", or the value "POSIX", the POSIX Locale shall be selected and the standard utilities shall behave in accordance with the rules in the 2.5.1 for the associated category. If LANG (or any of the LC_* environment variables) begins with a slash, it shall be interpreted as the pathname of a file that was created in the output format used by the localedef utility; see 4.35.6.3. Referencing such a pathname shall result in that locale being used for the category indicated. If LANG (or any of the LC_* environment variables) contains one of a set of implementation-defined values, the standard utilities shall behave in accordance with the rules in a corresponding implementation-defined locale description for the associated category. If LANG (or any of the LC_* environment variables) contains a value that the implementation does not recognize, the behavior is unspecified. Additional criteria for determining a valid locale name are implementation defined. BEGIN_RATIONALE 2.6.1 Environment Variables Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The standard is worded so that the specified variables _m_a_y be provided to the application. There is no way that the implementation can guarantee that a utility will ever see an environment variable, as a parent process can change the environment for its children. The env -i command in this standard and the POSIX.1 {8} _e_x_e_c family both offer ways to remove any of these variables from the environment. The language about locale implies that any utilities written in Standard C and conforming to POSIX.2 must issue the following call: setlocale(LC_ALL, "") If this were omitted, the C Standard {7} specifies that the C Locale would be used. If any of the environment variables is invalid, it makes sense to default to an implementation-defined, consistent locale environment. It is more confusing for a user to have partial settings occur in case of a mistake. All utilities would then behave in one language/cultural environment. Furthermore, it provides a way of forcing the whole environment to be the implementation-defined default. Disastrous results could occur if a Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 122 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 pipeline of utilities partially use the environment variables in different ways. In this case, it would be appropriate for utilities that use LANG and related variables to exit with an error if any of the variables are invalid. For example, users typing individual commands at a terminal might want date to work if LC_MONETARY is invalid as long as LC_TIME is valid. Since these are conflicting reasonable alternatives, POSIX.2 leaves the results unspecified if the locale environment variables would not produce a complete locale matching the user's specification. The locale settings of individual categories cannot be truly independent and still guarantee correct results. For example, when collating two strings, characters must first be extracted from each string (governed by LC_CTYPE) before being mapped to collating elements (governed by LC_COLLATE) for comparison. That is, if LC_CTYPE is causing parsing according to the rules of a large, multibyte code set (potentially returning 20000 or more distinct character code set values), but LC_COLLATE is set to handle only an 8-bit code set with 256 distinct characters, meaningful results are obviously impossible. The LC_MESSAGES variable affects the language of messages generated by the standard utilities. This standard does not provide a means whereby applications can easily be written to perform similar feats. Future versions of POSIX.1 {8} and POSIX.2 are expected to provide both functions and utilities to accomplish multilanguage messaging (using message catalogs), but such facilities were not ready for standardization at the time the initial versions of the standards were developed. This clause is not a full list of all environment variables, but only those of importance to multiple utilities. Nevertheless, to satisfy some members of the balloting group, here is a list of the other environment variable symbols mentioned in this standard: Variable Utility Variable Utility ________ _______ _________ _______ CDPATH cd MAKEFLAGS make COLUMNS ls OPTARG getopts DEAD mailx OPTIND getopts IFS sh PRINTER lp 1 LPDEST lp PS1 sh MAIL sh PS2 sh MAILRC mailx The description of PATH is similar to that in POSIX.1 {8}, except: - The behavior of a null prefix is marked obsolescent in favor of using a real pathname. This was done at the behest of some members of the balloting group, who apparently felt it offered a more secure environment, where the current directory would not be selected unintentionally. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.6 Environment Variables 123 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX - The POSIX.1 {8} _e_x_e_c description requires an implementation-defined path search when PATH is ``not present.'' POSIX.2 spells out that this means ``unset or set to null.'' Many implementations historically have used a default value of /bin and /usr/bin. POSIX.2 does not mandate that this default path be identical to that retrieved from getconf _CS_PATH because it is likely that a transition to POSIX.2 conformance will see the newly-standardized utilities in another directory that needs to be isolated from some historical applications. - The POSIX.1 {8} PATH description is ambiguous about whether an ``executable file'' means one that has the appropriate permissions for the searching process to execute it. One reading would say that a file with any of the execution bits set on would satisfy the search and that an [EACCES] could be returned at that point. This is not the way historical systems work and POSIX.2 has clarified it to mean that the path search will continue until it finds the name with the execute permissions that would allow the process to execute it. (The case of the [ENOEXEC] error is handled in the text of 3.9.1.1.) The terminology ``beginning to end'' is used in PATH to avoid the noninternationalized ``left to right.'' There is no way to have a colon character embedded within a pathname that is part of the PATH variable string. Colon is not a member of the portable filename character set, so this should not be a problem. A portable application can retrieve a default PATH value (that will allow access to all the standard utilities) from the system using the command: getconf _CS_PATH See the rationale with command for an example of using this. The SHELL variable names the user's preferred shell; it is a guide to applications. There is no direct requirement that that shell conform to this standard--that decision should rest with the user. It is the intention of the developers of this standard that alternative shells be permitted, if the user chooses to develop or acquire one. An operating system that builds its shell into the ``kernel'' in such a manner that alternative shells would be impossible does not conform to the spirit of the standard. The following environment variables are not currently used by the standard utilities (although they may be by future UPE utilities). Implementations should reserve the names for the following purposes: EDITOR The name of the user's preferred text file editor. The value of this variable is the name of a utility: either a pathname containing a slash, or a filename to be located Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 124 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 using the PATH environment variable. VISUAL The name of the user's preferred ``visual,'' or full- screen, text file editor. The value of this variable is the name of a utility: either a pathname containing a slash, or a filename to be located using the PATH environment variable. The decision to restrict conforming systems to the use of digits, uppercase letters, and underscores for environment variable names allows applications to use lowercase letters in their environment variable names without conflicting with any conforming system. PROCLANG was added to an earlier draft for internationalized applications, but was removed from the standard because the working group determined that it was not of use. USER was removed from an earlier draft because it was an unreasonable duplication of LOGNAME. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.6 Environment Variables 125 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.7 Required Files The following directories shall exist on conforming systems and shall be used as described. Strictly Conforming POSIX.2 Applications shall not assume the ability to create files in any of these directories. / The root directory. /dev Contains /dev/null and /dev/tty, described below. The following directory shall exist on conforming systems and shall be used as described. /tmp A directory made available for programs that need a place to create temporary files. Applications shall be allowed to create files in this directory, but shall not assume that such files are preserved between invocations of the application. The following files shall exist on conforming systems and shall be both readable and writable. /dev/null An infinite data source/sink. Data written to /dev/null is discarded. Reads from /dev/null always return end-of- file (EOF). /dev/tty In each process, a synonym for the controlling terminal associated with the process group of that process, if any. It is useful for programs or shell procedures that wish to be sure of writing messages to or reading data from the terminal no matter how output has been redirected. BEGIN_RATIONALE 2.7.1 Required Files Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) A description of the historical /usr/tmp was omitted, removing any concept of differences in emphasis between the / and /usr versions. The descriptions of /bin, /usr/bin, /lib, and /usr/lib were omitted because they are not useful for applications. In an early draft, a distinction was made between _s_y_s_t_e_m and _a_p_p_l_i_c_a_t_i_o_n directory usage, but this was not found to be useful. In Draft 8, /, /dev, /local, /usr/local, and /usr/man were removed. The directories / and /dev were restored in Draft 9. It was pointed out by several balloters that the notion of a hierarchical directory structure is key to other information presented in later sections of the standard. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 126 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 (Previously, some had argued that special devices and temporary files could conceivably be handled without a directory structure on some implementations. For example, the system could treat the characters ``/tmp'' as a special token that would store files using some non-POSIX file system structure. This notion was rejected by the working group, which requires that all the files in this clause be implemented via POSIX file systems.) The /tmp directory is retained in the standard to accommodate historical applications that assume its availability. Future implementations are encouraged to provide suitable directory names in TMPDIR and future applications are encouraged to use the contents of TMPDIR for creating temporary files. The standard files /dev/null and /dev/tty are required to be both readable and writable to allow applications to have the intended historical access to these files. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.7 Required Files 127 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.8 Regular Expression Notation _E_d_i_t_o_r'_s _N_o_t_e: _T_h_e _e_n_t_i_r_e _r_a_t_i_o_n_a_l_e _f_o_r _t_h_i_s _c_l_a_u_s_e _a_p_p_e_a_r_s _a_t _t_h_e _e_n_d _o_f _t_h_e _c_l_a_u_s_e. _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s (REs) provide a mechanism to select specific strings from a set of character strings. Regular expressions are a context-independent syntax that can represent a wide variety of character sets and character set orderings, where these character sets are interpreted according to the current locale. While many regular expressions can be interpreted differently depending on the current locale, many features, such as character class expressions, provide for contextual invariance across locales. The Basic Regular Expression (BRE) notation and construction rules in 2.8.3 shall apply to most utilities supporting regular expressions. Some utilities, instead, support the Extended Regular Expressions (ERE) described in 2.8.4; any exceptions for both cases are noted in the descriptions of the specific utilities using regular expressions. Both BREs and EREs are supported by the Regular Expression Matching interface in 7.3. 2.8.1 Regular Expression Definitions For the purposes of this clause, the following definitions apply. 2.8.1.1 entire regular expression: The concatenated set of one or more BREs or EREs that make up the pattern specified for string selection. 2.8.1.2 matched: A sequence of zero or more characters is said to be matched by a BRE or ERE when the characters in the sequence corresponds to a sequence of characters defined by the pattern. Matching shall be based on the bit pattern used for encoding the 1 character, not on the graphic representation of the character. 1 The search for a matching sequence shall start at the beginning of a string and stop when the first sequence matching the expression is found, where ``first'' is defined to mean ``begins earliest in the string.'' If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest 1 such sequence shall be matched. For example: the BRE bb* matches the 1 second through fourth characters of abbbc, and the ERE 1 (wee|week)(knights|night) matches all ten characters of weeknights. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 128 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 Consistent with the whole match being the longest of the leftmost 1 matches, each subpattern, from left to right, shall match the longest 1 possible string. For this purpose, a null string shall be considered to 2 be longer than no match at all. For example, matching the BRE \(.*\).* 2 against abcdef, the subexpression (\1) is abcdef, and matching the BRE 2 \(a*\)* against bc, the subexpression (\1) is the null string. 2 When a multicharacter collating element in a bracket expression (see 1 2.8.3.2) is involved, the longest sequence shall be measured in 1 characters consumed from the string to be matched; i.e., the collating 1 element counts not as one element, but as the number of characters it 1 matches. 1 2.8.1.3 BRE [ERE] matching a single character: A BRE or ERE that matches either a single character or a single collating element. Only a BRE or ERE of this type that includes a bracket expression (see 1 2.8.3.2) can match a collating element. 1 2.8.1.4 BRE [ERE] matching multiple characters: A BRE or ERE that matches a concatenation of single characters or collating elements. Such a BRE or ERE is made up from a _B_R_E (_E_R_E) _m_a_t_c_h_i_n_g _a _s_i_n_g_l_e _c_h_a_r_a_c_t_e_r and _B_R_E (_E_R_E) _s_p_e_c_i_a_l _c_h_a_r_a_c_t_e_rs. 1 2.8.2 Regular Expression General Requirements The requirements in this subclause shall apply to both basic and extended regular expressions. The use of regular expressions is generally associated with text processing; i.e., REs (BREs and EREs) operate on text strings; i.e., zero or more characters followed by an end-of-string delimiter (typically NUL). Some utilities employing regular expressions limit the processing to lines; i.e., zero or more characters followed by a . In the regular expression processing described in this standard, the character is regarded as an ordinary character. This standard specifies 1 within the individual descriptions of those standard utilities employing 1 regular expressions whether they permit matching of s; if not 1 stated otherwise, the use of literal s or any escape sequence 1 equivalent produces undefined results. 1 The interfaces specified in this standard do not permit the inclusion of a NUL character in an RE or in the string to be matched. If during the operation of a standard utility a NUL is included in the text designated to be matched, that NUL may designate the end of the text string for the 1 purposes of matching. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 129 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX When a standard utility or function that uses regular expressions specifies that pattern matching shall be performed without regard to the case (upper- or lower-) of either data or patterns, then when each character in the string is matched against the pattern, not only the character, but also its case counterpart (if any), shall be matched. The implementation shall support any regular expression that does not exceed 256 bytes in length. This clause uses the term ``invalid'' for certain constructs or 1 conditions. Invalid REs shall cause the utility or function using the RE 1 to generate an error condition. When ``invalid'' is not used, violations 1 of the specified syntax or semantics for REs produce undefined results: 1 this may entail an error, enabling an extended syntax for that RE, or 1 using the construct in error as literal characters to be matched. 1 2.8.3 Basic Regular Expressions 2.8.3.1 BREs Matching a Single Character or Collating Element A BRE ordinary character, a special character preceded by a backslash, or a period shall match a single character. A bracket expression shall match a single character or a single collating element. 2.8.3.1.1 BRE Ordinary Characters An ordinary character is a BRE that matches itself: any character in the supported character set, except for the BRE special characters listed in 2.8.3.1.2. The interpretation of an ordinary character preceded by a backslash (\) is undefined, except for: (1) The characters ), (, {, and }. (2) The digits 1 through 9 (see 2.8.3.3). (3) A character inside a bracket expression. 2.8.3.1.2 BRE Special Characters A _B_R_E _s_p_e_c_i_a_l _c_h_a_r_a_c_t_e_r has special properties in certain contexts. 1 Outside of those contexts, or when preceded by a backslash, such a 1 character shall be a BRE that matches the special character itself. The 1 BRE special characters and the contexts in which they have their special meaning are: Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 130 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 . [ \ The period, left-bracket, and backslash shall be special except when used in a bracket expression (see 2.8.3.2). An expression containing a [ that is not preceded by a backslash and is not part of a bracket expression produces undefined 1 results. 1 * The asterisk is special except when used - In a bracket expression, 1 - As the first character of an entire BRE (after an initial 1 ^, if any), or 1 - As the first character of a subexpression (after an 1 initial ^, if any); see 2.8.3.3. 1 ^ The circumflex shall be special when used 1 - As an anchor (see 2.8.3.5) or, 1 - As the first character of a bracket expression (see 1 2.8.3.2). 1 $ The dollar-sign shall be special when used as an anchor. 1 2.8.3.1.3 Periods in BREs A period (.), when used outside of a bracket expression, is a BRE that shall match any character in the supported character set except NUL. 1 2.8.3.2 RE Bracket Expression A bracket expression (an expression enclosed in square brackets, []) is an RE that matches a single collating element contained in the nonempty 1 set of collating elements represented by the bracket expression. 1 The following rules and definitions apply to bracket expressions: (1) A _b_r_a_c_k_e_t _e_x_p_r_e_s_s_i_o_n is either a matching list expression or a nonmatching list expression. It consists of one or more expressions: collating elements, collating symbols, equivalence 1 classes, character classes, or range expressions. Strictly Conforming POSIX.2 Applications shall not use range expressions, but conforming implementations shall support regular expressions containing range expressions. The right-bracket (]) shall lose its special meaning and represent itself in a bracket expression if it occurs first in the list [after an initial circumflex (^), if any]. Otherwise, it shall terminate the bracket expression, unless it appears in a collating symbol (such as [.].]) or is 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 131 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX the ending right-bracket for a collating symbol, equivalence 1 class, or character class). The special characters . * [ \ (period, asterisk, left-bracket, and backslash, respectively) shall lose their special meaning within a bracket expression. The character sequences [. [= [: (left-bracket followed by a period, equals-sign, or colon) shall be special inside a bracket expression and are used to delimit collating symbols, equivalence class expressions, and character class expressions. These symbols shall be followed by a valid expression and the matching terminating sequence .], =], or :], as described in the following items. (2) A _m_a_t_c_h_i_n_g _l_i_s_t expression specifies a list that shall match any one of the expressions represented in the list. The first character in the list shall not be the circumflex. For example, [abc] is an RE that matches any of a, b, or c. (3) A _n_o_n_m_a_t_c_h_i_n_g _l_i_s_t expression begins with a circumflex (^), and specifies a list that shall match any character or collating element except for the expressions represented in the list after 1 the leading circumflex. For example, [^abc] is an RE that matches any character or collating element except a, b, or c. 1 The circumflex shall have this special meaning only when it occurs first in the list, immediately following the left- bracket. (4) A _c_o_l_l_a_t_i_n_g _s_y_m_b_o_l is a collating element enclosed within bracket-period ([. .]) delimiters. Collating elements are defined as described in 2.5.2.2.4. Multicharacter collating 1 elements shall be represented as collating symbols when it is necessary to distinguish them from a list of the individual characters that make up the multicharacter collating element. For example, if the string ch is a collating element in the current collation sequence with the associated collating symbol , the expression [[.ch.]] shall be treated as an RE matching the character sequence ch, while [ch] shall be treated as an RE matching c or h. Collating symbols shall be recognized only 1 inside bracket expressions. This implies that the RE [[.ch.]]*c shall match the first through fifth character in the string chchch. If the string is not a collating element in the current collating sequence definition, or if the collating element has 1 no characters associated with it (e.g., see the symbol in 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 132 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 the example collation definition shown in 2.5.2.2.4), the symbol 1 shall be treated as an invalid expression. 1 (5) An _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s _e_x_p_r_e_s_s_i_o_n shall represent the set of collating elements belonging to an equivalence class, as 1 described in 2.5.2.2.4. Only primary equivalence classes shall 1 be recognized. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ([= =]) delimiters. For example, if a, a`, and a^ belong to the same equivalence class, then [[=a=]b], [[=a`=]b], and [[=a^=]b] shall each be equivalent to [aa`a^b]. If the collating element does not belong to an equivalence class, the equivalence class expression shall be treated as a _c_o_l_l_a_t_i_n_g _s_y_m_b_o_l. (6) A _c_h_a_r_a_c_t_e_r _c_l_a_s_s _e_x_p_r_e_s_s_i_o_n shall represent the set of characters belonging to a character class, as defined in the LC_CTYPE category in the current locale. All character classes specified in the current locale shall be recognized. A character class expression shall be expressed as a character class name enclosed within ``bracket-colon'' ([: :]) delimiters. Strictly conforming POSIX.2 applications shall only use the following character class expressions, which shall be supported on all conforming implementations: [:alnum:] [:cntrl:] [:lower:] [:space:] [:alpha:] [:digit:] [:print:] [:upper:] [:blank:] [:graph:] [:punct:] [:xdigit:] (7) A _r_a_n_g_e _e_x_p_r_e_s_s_i_o_n represents the set of collating elements that fall between two elements in the current collation sequence, 1 inclusively. It shall be expressed as the starting point and 1 the ending point separated by a hyphen (-). Range expressions shall not be used in Strictly Conforming POSIX.2 Applications because their behavior is dependent on the collating sequence. Range expressions shall be supported by conforming implementations. In the following, all examples assume the collation sequence specified for the POSIX Locale, unless another collation sequence is specifically defined. The starting range point and the ending range point shall be a collating element or collating symbol. An equivalence class 2 expression used as a starting or ending point of a range 2 expression produces unspecified results. The ending range point 2 shall collate equal to or higher than the starting range point; 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 133 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX otherwise the expression shall be treated as invalid. The order used is the order in which the collating elements are specified in the current collation definition. One-to-many mappings (see 2.5.2.2) shall not be performed. For example, assuming that the character eszet (B) is placed in the basic collation sequence after r and s, but before t, and that it maps to the sequence ss for collation purposes, then the expression [r-s] matches only r and s, but the expression [s-t] matches s, B, or t. The interpretation of range expressions where the ending range point also is the starting range point of a subsequent range expression is undefined. The hyphen character shall be treated as itself if it occurs first (after an initial ^, if any) or last in the list, or as an ending range point in a range expression. As examples, the expressions [-ac] and [ac-] are equivalent and match any of the characters a, c, or -; the expressions [^-ac] and [^ac-] are equivalent and match any characters except a, c, or -; the 1 expression [%--] matches any of the characters between % and - 1 inclusive; the expression [--@] matches any of the characters between - and @, inclusive; and the expression [a--@] is invalid, because the letter a follows the symbol - in the POSIX Locale. To use a hyphen as the starting range point, it shall either come first in the bracket expression or be specified as a collating symbol. For example: [][.-.]-0], which matches either a right bracket or any character or collating element 1 that collates between hyphen and 0, inclusive. 1 2.8.3.3 BREs Matching Multiple Characters The following rules can be used to construct BREs matching multiple characters from BREs matching a single character: (1) The concatenation of BREs shall match the concatenation of the strings matched by each component of the BRE. 1 (2) A _s_u_b_e_x_p_r_e_s_s_i_o_n can be defined within a BRE by enclosing it between the character pairs \( and \). Such a subexpression shall match whatever it would have matched without the \( and \), except that anchoring within subexpressions is optional 1 behavior; see 2.8.3.5. Subexpressions can be arbitrarily 1 nested. 1 (3) The _b_a_c_k_r_e_f_e_r_e_n_c_e expression \_n shall match the same (possibly 1 empty) string of characters as was matched by a subexpression 1 enclosed between \( and \) preceding the \_n. The character _n shall be a digit from 1 through 9, specifying the _n-th subexpression [the one that begins with the _n-th \( and ends Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 134 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 with the corresponding paired \)]. The expression is invalid if less than _n subexpressions precede the \_n. For example, the expression ^\(.*\)\1$ matches a line consisting of two adjacent appearances of the same string, and the expression \(a\)*\1 2 fails to match a. 2 (4) When a BRE matching a single character, a subexpression, or a 1 backreference is followed by the special character asterisk (*), 1 together with that asterisk it shall match what zero or more 2 consecutive occurrences of the BRE would match. For example, 2 [ab]* and [ab][ab] are equivalent when matching the string ab. 2 (5) When a BRE matching a single character, a subexpression, or a 1 backreference is followed by an _i_n_t_e_r_v_a_l _e_x_p_r_e_s_s_i_o_n of the 1 format \{_m\}, \{_m,\}, or \{_m,_n\}, together with that interval 1 expression it shall match what repeated consecutive occurrences 2 of the BRE would match. The values of _m and _n shall be decimal 2 integers in the range 0 _< _m _< _n _< {RE_DUP_MAX}, where _m 1 specifies the exact or minimum number of occurrences and _n specifies the maximum number of occurrences. The expression \{_m\} shall match exactly _m occurrences of the preceding BRE, \{_m,\} shall match at least _m occurrences, and \{_m,_n\} shall match any number of occurrences between _m and _n, inclusive. 1 For example, in the string abababccccccd the BRE c\{3\} is matched by characters seven through nine, the BRE \(ab\)\{4,\} is not matched at all, and the BRE c\{1,3\}d is matched by characters ten through thirteen. The behavior of multiple adjacent duplication symbols (* and intervals) 1 produces undefined results. 1 2.8.3.4 BRE Precedence 1 The order of precedence shall be as shown in Table 2-12, from high to 1 low. 1 2.8.3.5 BRE Expression Anchoring A BRE can be limited to matching strings that begin or end a line; this 1 is called _a_n_c_h_o_r_i_n_g. The circumflex and dollar-sign special characters 1 shall be considered BRE anchors in the following contexts: 1 (1) A circumflex (^) shall be an anchor when used as the first 1 character of an entire BRE. The implementation may treat 1 circumflex as an anchor when used as the first character of a 1 subexpression. The circumflex shall anchor the expression (or 1 optionally subexpression) to the beginning of a string; only 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 135 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Table 2-12 - BRE Precedence 1 __________________________________________________________________________________________________________________________________________________ 1 _c_o_l_l_a_t_i_o_n-_r_e_l_a_t_e_d _b_r_a_c_k_e_t _s_y_m_b_o_l_s [= =] [: :] [. .] 1 _e_s_c_a_p_e_d _c_h_a_r_a_c_t_e_r_s \<_s_p_e_c_i_a_l _c_h_a_r_a_c_t_e_r> 1 _b_r_a_c_k_e_t _e_x_p_r_e_s_s_i_o_n [ ] 1 _s_u_b_e_x_p_r_e_s_s_i_o_n_s/_b_a_c_k_r_e_f_e_r_e_n_c_e_s \( \) \_n 1 _s_i_n_g_l_e-_c_h_a_r_a_c_t_e_r-_B_R_E _d_u_p_l_i_c_a_t_i_o_n * \{_m,_n\} 1 _c_o_n_c_a_t_e_n_a_t_i_o_n 1 _a_n_c_h_o_r_i_n_g ^ $ 1 __________________________________________________________________________________________________________________________________________________ sequences starting at the first character of a string shall be 1 matched by the BRE. For example, the BRE ^ab matches ab in the 1 string abcdef, but fails to match in the string cdefab. The BRE 1 \(^ab\) may match the former string. A portable BRE shall 1 escape a leading circumflex in a subexpression to match a 1 literal circumflex. 1 (2) A dollar-sign ($) shall be an anchor when used as the last 1 character of an entire BRE. The implementation may treat a 1 dollar-sign as an anchor when used as the last character of a 1 subexpression. The dollar-sign shall anchor the expression (or 1 optionally subexpression) to the end of the string being 1 matched; the dollar-sign can be said to match the ``end-of- 1 string'' following the last character. 1 (3) A BRE anchored by both ^ and $ shall match only an entire 2 string. For example, the BRE ^abcdef$ matches strings consisting only of abcdef. 1 2.8.4 Extended Regular Expressions The _e_x_t_e_n_d_e_d _r_e_g_u_l_a_r _e_x_p_r_e_s_s_i_o_n (ERE) notation and construction rules shall apply to utilities defined as using extended regular expressions; any exceptions to the following rules are noted in the descriptions of the specific utilities using EREs. 2.8.4.1 EREs Matching a Single Character or Collating Element An ERE ordinary character, a special character preceded by a backslash, 1 or a period shall match a single character. A bracket expression shall 1 match a single character or a single collating element. An _E_R_E _m_a_t_c_h_i_n_g 1 _a _s_i_n_g_l_e _c_h_a_r_a_c_t_e_r enclosed in parentheses shall match the same as the ERE without parentheses would have matched. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 136 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.8.4.1.1 ERE Ordinary Characters An _o_r_d_i_n_a_r_y _c_h_a_r_a_c_t_e_r is an ERE that matches itself. An ordinary character is any character in the supported character set, except for the 2 ERE special characters listed in 2.8.4.1.2. The interpretation of an 2 ordinary character preceded by a backslash (\) is undefined. 2.8.4.1.2 ERE Special Characters An _E_R_E _s_p_e_c_i_a_l _c_h_a_r_a_c_t_e_r has special properties in certain contexts. 1 Outside of those contexts, or when preceded by a backslash, such a 1 character shall be an ERE that matches the special character itself. The extended regular expression special characters and the contexts in which they shall have their special meaning are: . [ \ ( The period, left-bracket, backslash, and left-parenthesis 1 are special except when used in a bracket expression (see 1 2.8.3.2). * + ? { The asterisk, plus-sign, question-mark, and left-brace are special except when used in a bracket expression (see 2.8.3.2). Any of the following uses produce undefined 2 results: 2 - If these characters appear first in an ERE, or immediately following a vertical-line, circumflex, or left-parenthesis. - If a left-brace is not part of a valid interval 1 expression. 1 | The vertical-line is special except when used in a bracket expression (see 2.8.3.2). A vertical-line appearing first or last in an ERE, or immediately following a vertical- line or a left-parentheses, produces undefined results. 1 ^ The circumflex shall be special when used 1 - As an anchor (see 2.8.4.6) or, 1 - As the first character of a bracket expression (see 1 2.8.3.2). 1 $ The dollar-sign shall be special when used as an anchor. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 137 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.8.4.1.3 Periods in EREs A period (.), when used outside of a bracket expression, is an ERE that shall match any character in the supported character set except NUL. 1 2.8.4.2 ERE Bracket Expression The rules for ERE Bracket Expressions are the same as for Basic Regular Expressions; see 2.8.3.2. 2.8.4.3 EREs Matching Multiple Characters The following rules shall be used to construct EREs matching multiple characters from EREs matching a single character: (1) A _c_o_n_c_a_t_e_n_a_t_i_o_n _o_f _E_R_E_s shall match the concatenation of the character sequences matched by each component of the ERE. A 1 concatenation of EREs enclosed in parentheses shall match whatever the concatenation without the parentheses matches. For example, both the ERE cd and the ERE (cd) are matched by the third and fourth character of the string abcdefabcdef. (2) When an ERE matching a single character, or a concatenation of 1 EREs enclosed in parentheses is followed by the special 1 character plus-sign (+), together with that plus-sign it shall 1 match what one or more consecutive occurrences of the ERE would 2 match. For example, the ERE b+(bc) matches the fourth through 2 seventh characters in the string acabbbcde. And, [ab]+ and 2 [ab][ab]* are equivalent. 2 (3) When an ERE matching a single character, or a concatenation of 1 EREs enclosed in parentheses is followed by the special 1 character asterisk (*), together with that asterisk it shall 1 match what zero or more consecutive occurrences of the ERE would 2 match. For example, the ERE b*c matches the first character in the string cabbbcde, and the ERE b*cd matches the third through seventh characters in the string cabbbcdebbbbbbcdbc. And, [ab]* 2 and [ab][ab] are equivalent when matching the string ab. 2 (4) When an ERE matching a single character, or a concatenation of 1 EREs enclosed in parentheses is followed by the special 1 character question-mark (?), together with that question-mark it 1 shall match what zero or one consecutive occurrences of the ERE 2 would match. For example, the ERE b?c matches the second 2 character in the string acabbbcde. (5) When an ERE matching a single character, or a concatenation of 1 EREs enclosed in parentheses is followed by an _i_n_t_e_r_v_a_l 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 138 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 _e_x_p_r_e_s_s_i_o_n of the format {_m}, {_m,}, or {_m,_n}, together with that 1 interval expression it shall match what repeated consecutive 2 occurrences of the ERE would match. The values of _m and _n shall 2 be decimal integers in the range 0 _< _m _< _n _< {RE_DUP_MAX}, where 1 _m specifies the exact or minimum number of occurrences and _n specifies the maximum number of occurrences. The expression {_m} shall match exactly _m occurrences of the preceding ERE, {_m,} shall match at least _m occurrences, and {_m,_n} shall match any number of occurrences between _m and _n, inclusive. 1 For example, in the string abababccccccd the ERE c{3} is matched 1 by characters seven through nine, and the ERE (ab){2,} is 2 matched by characters one through six. 2 The behavior of multiple adjacent duplication symbols (+, *, ?, and 1 intervals) produces undefined results. 1 2.8.4.4 ERE Alternation Two EREs separated by the special character vertical-line (|) shall match a string that is matched by either. For example, the ERE a((bc)|d) matches the string abc and the string ad. Single characters, or expressions matching single characters, separated by the vertical bar and enclosed in parentheses, shall be treated as an ERE matching a single character. 1 2.8.4.5 ERE Precedence The order of precedence shall be as shown in Table 2-13, from high to 1 low. 1 Table 2-13 - ERE Precedence 1 __________________________________________________________________________________________________________________________________________________ 1 _c_o_l_l_a_t_i_o_n-_r_e_l_a_t_e_d _b_r_a_c_k_e_t _s_y_m_b_o_l_s [= =] [: :] [. .] 1 _e_s_c_a_p_e_d _c_h_a_r_a_c_t_e_r_s \<_s_p_e_c_i_a_l _c_h_a_r_a_c_t_e_r> 1 _b_r_a_c_k_e_t _e_x_p_r_e_s_s_i_o_n [ ] 1 _g_r_o_u_p_i_n_g ( ) 1 _s_i_n_g_l_e-_c_h_a_r_a_c_t_e_r-_E_R_E _d_u_p_l_i_c_a_t_i_o_n * + ? {_m,_n} 1 _c_o_n_c_a_t_e_n_a_t_i_o_n 1 _a_n_c_h_o_r_i_n_g ^ $ 1 _a_l_t_e_r_n_a_t_i_o_n | 1 __________________________________________________________________________________________________________________________________________________ For example, the ERE abba|cde matches either the string abba or the 1 string cde (because concatenation has a higher order of precedence than 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 139 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX alternation). 2.8.4.6 ERE Expression Anchoring An ERE can be limited to matching strings that begin or end a line; this 1 is called _a_n_c_h_o_r_i_n_g. The circumflex and dollar-sign special characters 1 shall be considered ERE anchors in the following contexts: 1 (1) A circumflex (^) shall be an anchor when used anywhere outside a 1 bracket expression. The circumflex shall anchor the 1 (sub)expression to the beginning of a string; only sequences 1 starting at the first character of a string shall be matched by 1 the ERE. For example, the EREs ^ab and (^ab) match ab in the 1 string abcdef, but fail to match in the string cdefab. 1 (2) A dollar-sign ($) shall be an anchor when used anywhere outside 1 a bracket expression. It shall anchor the expression to the end 1 of the string being matched; the dollar-sign can be said to match the ``end-of-string'' following the last character. (3) An ERE anchored by both ^ and $ shall match only an entire 2 string. For example, the EREs ^abcdef$ and (^abcdef$) match strings consisting only of abcdef. 2.8.5 Regular Expression Grammar Grammars describing the syntax of both basic and extended regular expressions are presented in this subclause. See the grammar conventions in 2.1.2. 2.8.5.1 BRE/ERE Grammar Lexical Conventions The lexical conventions for regular expressions shall be as described in this subclause. Except as noted, the longest possible token or delimiter beginning at a given point shall be recognized. The following tokens shall be processed (in addition to those string constants shown in the grammar): COLL_ELEM Shall be any single-character collating element, unless it is a META_CHAR. BACKREF (Applicable only to basic regular expressions.) Shall be the character string consisting of '\' followed by a single-digit numeral, 1 through 9. 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 140 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 DUP_COUNT Shall represent a numeric constant. It shall be an integer in the range 0 _< DUP_COUNT _< {RE_DUP_MAX}. 1 This token shall only be recognized when the context of the grammar requires it. At all other times, digits not preceded by '\' shall be treated as ORD_CHAR. META_CHAR Shall be one of the characters: ^ When found first in a bracket expression - When found anywhere but first (after an initial ^, if any) or last in a bracket expression, or as the ending range point in a range expression ] When found anywhere but first (after an initial ^, if any) in a bracket expression. L_ANCHOR (Applicable only to basic regular expressions.) Shall be the character ^ when it appears as the first character of a basic regular expression and when not 1 QUOTED_CHAR. The ^ may be recognized as an anchor 1 elsewhere; see 2.8.3.5. 1 ORD_CHAR Shall be a character, other than one of the special 1 characters in SPEC_CHAR. 1 QUOTED_CHAR Shall be one of the character sequences: 1 \^ \. \* \[ \$ \\ 1 R_ANCHOR (Applicable only to basic regular expressions). Shall 1 be the character $ when it appears as the last 1 character of a basic regular expression and when not 1 QUOTED_CHAR. The $ may be recognized as an anchor 1 elsewhere; see 2.8.3.5. 1 SPEC_CHAR For basic regular expressions, shall be one of the following special characters: . Anywhere outside bracket expressions \ Anywhere outside bracket expressions [ Anywhere outside bracket expressions ^ When an anchor; see 2.8.3.5 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 141 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX $ When an anchor; see 2.8.3.5 2 * Anywhere except: first in an entire RE; anywhere in a bracket expression; directly following \(; directly following an anchoring ^. For extended regular expressions, shall be one of the following special characters found anywhere outside bracket expressions: ^ . [ $ ( ) | * + ? { \ (The close-parenthesis shall be considered special in 2 this context only if matched with a preceding open- 2 parenthesis.) 2 2.8.5.2 RE and Bracket Expression Grammar This subclause presents the grammar for basic regular expressions, including the bracket expression grammar that is common to both BREs and EREs. %token ORD_CHAR QUOTED_CHAR SPEC_CHAR DUP_COUNT %token BACKREF L_ANCHOR R_ANCHOR %token Back_open_paren Back_close_paren /* '\(' '\)' */ %token Back_open_brace Back_close_brace /* '\{' '\}' */ /* The following tokens are for the Bracket Expression grammar common to both REs and EREs. */ %token COLL_ELEM META_CHAR 1 %token Open_equal Equal_close Open_dot Dot_close Open_colon Colon_close 1 /* '[=' '=]' '[.' '.]' '[:' ':]' */ 1 %token class_name /* class_name is a keyword to the LC_CTYPE locale category */ /* (representing a character class) in the current locale */ /* and is only recognized between [: and :] */ %start basic_reg_exp Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 142 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 %% /* -------------------------------------------- Basic Regular Expression -------------------------------------------- */ basic_reg_exp : RE_expression | L_ANCHOR | R_ANCHOR | L_ANCHOR R_ANCHOR | L_ANCHOR RE_expression | RE_expression R_ANCHOR | L_ANCHOR RE_expression R_ANCHOR ; RE_expression : simple_RE | RE_expression simple_RE ; simple_RE : nondupl_RE | nondupl_RE RE_dupl_symbol 1 ; nondupl_RE : one_character_RE | Back_open_paren RE_expression Back_close_paren | Back_open_paren Back_close_paren | BACKREF ; /* 1 Note: This grammar does not permit L_ANCHOR or 1 R_ANCHOR inside \( and \) (which implies that ^ and $ 1 are ordinary characters). This reflects the semantic 1 limits on the application, as noted in 2.8.3.5. 1 Implementations are permitted to extend the language to 1 interpret ^ and $ as anchors in these locations, and as 1 such portable applications shall not use unescaped ^ 1 and $ in positions inside \( and \) that might be 1 interpreted as anchors. 1 */ 1 one_character_RE : ORD_CHAR | QUOTED_CHAR | '.' | bracket_expression ; Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 143 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX RE_dupl_symbol : '*' | Back_open_brace DUP_COUNT Back_close_brace | Back_open_brace DUP_COUNT ',' Back_close_brace | Back_open_brace DUP_COUNT ',' DUP_COUNT Back_close_brace ; /* -------------------------------------------- Bracket Expression ------------------------------------------- */ bracket_expression : '[' matching_list ']' | '[' nonmatching_list ']' ; matching_list : bracket_list ; nonmatching_list : '^' bracket_list ; bracket_list : follow_list | follow_list '-' 1 ; follow_list : expression_term | follow_list expression_term ; expression_term : single_expression | range_expression ; single_expression : end_range | character_class 1 ; range_expression : start_range end_range | start_range '-' ; start_range : end_range '-' ; Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 144 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 end_range : COLL_ELEM | collating_symbol 2 ; collating_symbol : Open_dot COLL_ELEM Dot_close | Open_dot META_CHAR Dot_close ; equivalence_class : Open_equal COLL_ELEM Equal_close ; character_class : Open_colon class_name Colon_close 1 ; 2.8.5.3 ERE Grammar This subclause presents the grammar for extended regular expressions, excluding the bracket expression grammar. NOTE: The bracket expression grammar and the associated %token lines are identical between BREs and EREs. It has been omitted from the ERE subclause to avoid unnecessary editorial duplication. %token ORD_CHAR QUOTED_CHAR SPEC_CHAR DUP_COUNT %start extended_reg_exp %% /* -------------------------------------------- Extended Regular Expression -------------------------------------------- */ extended_reg_exp : anchored_ERE | nonanchored_ERE | extended_reg_exp '|' nonanchored_ERE | extended_reg_exp '|' anchored_ERE ; Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 145 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX anchored_ERE : '^' nonanchored_ERE | '^' nonanchored_ERE '$' | nonanchored_ERE '$' | '^' | '$' | '^' '$' ; nonanchored_ERE : ERE_expression | nonanchored_ERE ERE_expression ; ERE_expression : one_character_ERE | '(' extended_reg_exp ')' | ERE_expression ERE_dupl_symbol ; one_character_ERE : ORD_CHAR | '\' SPEC_CHAR | '.' | bracket_expression ; ERE_dupl_symbol : '*' | '+' | '?' | '{' DUP_COUNT '}' | '{' DUP_COUNT ',' '}' | '{' DUP_COUNT ',' DUP_COUNT '}' ; BEGIN_RATIONALE 2.8.6 Regular Expression Notation Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) _E_d_i_t_o_r'_s _N_o_t_e: _S_o_m_e _o_f _t_h_e _t_e_x_t _a_n_d _h_e_a_d_i_n_g_s _o_f _t_h_i_s _r_a_t_i_o_n_a_l_e _h_a_v_e _b_e_e_n 1 _r_e_a_r_r_a_n_g_e_d. _M_o_v_e_d _t_e_x_t _h_a_s _n_o_t _b_e_e_n _d_i_f_f_m_a_r_k_e_d _u_n_l_e_s_s _i_t _c_h_a_n_g_e_d. 1 Rather than repeating the description of regular expressions for each utility supporting REs, the working group preferred a common, comprehensive description of regular expressions in one place. The most common behavior is described here, and exceptions or extensions to this are documented for the respective utilities, if appropriate. The Basic Regular Expression corresponds to the ed or historical grep type, and the Extended Regular Expression corresponds to the historical egrep type (now grep -E). Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 146 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The text is based on the ed description and substantially modified, primarily to aid developers and others in the understanding of the capabilities and limitations of regular expressions. Much of this was influenced by the internationalization requirements. It should be noted that the definitions in this clause do not cover the tr utility (see 4.64); the tr syntax does not employ regular expressions. The specification of regular expressions are particularly important to internationalization, because pattern matching operations are very basic operations in business and other operations. The syntax and rules of regular expressions are intended to be as intuitive as possible, to make them easy to understand and use. The historical rules and behavior do not provide that capability to non-English-language users, and does not provide the necessary support for commonly used characters and language constructs. It was necessary to provide extensions to the historical regular expression syntax and rules, to accommodate other languages. Such modifications were proposed by the UniForum Technical Committee Subcommittee on Internationalization and accepted by the working group. As they are limited to bracket expressions, the rationale for these modifications can be found in 2.8.6.3.2. 2.8.6.1 Regular Expression Definitions Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The definition of which sequence is matched when several are possible is based on the leftmost-longest rule historically used by deterministic 1 recognizers. This rule is much easier to define and describe, and arguably more useful, than the first-match rule historically used by nondeterministic recognizers. It is thought that dependencies on the choice of rule are rare; carefully-contrived examples are needed to demonstrate the difference. A formal expression of the leftmost-longest rule is: 1 The search is performed as if all possible suffixes of the string were tested for a prefix matching the pattern; the longest suffix containing a matching prefix is chosen, and the longest possible matching prefix of the chosen suffix is identified as the matching sequence. It is possible to determine what strings correspond to subexpressions by 1 recursively applying the leftmost longest rule to each subexpression, but 1 only with the proviso that the overall match is leftmost longest (see 1 2.8.1.2). For example, matching \(ac*\)c*d[ac]*\1 against acdacaaa 1 should match acdacaaa (with \1=a); simply matching the longest match for 1 \(ac*\) would yield \1=ac, but the overall match would be smaller 1 (acdac). In principle, the implementation must examine every possible 1 match and among those that yield the leftmost longest total matches, pick 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 147 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX the one that does the longest match for the leftmost subexpression and so 1 on. Note that this means that matching by subexpressions is context 1 dependent: a subexpression within a larger RE may match a different 1 string from the one it would match as an independent RE, and two 1 instances of the same subexpression within the same larger RE may match 1 different lengths even in similar sequences of characters. For example, 1 in the ERE (a.*b)(a.*b), the two identical subexpressions would match 1 four and six characters, respectively, of accbaccccb. Thus, it is not 1 possible to hierarchically decompose the matching problem into smaller, 1 independent, matching problems. 1 Matching is based on the bit pattern used for encoding the character, not on the graphic representation of the character. This means that if a character set contains two or more encodings for a graphic symbol, or if the strings searched contain text encoded in more than one code set, no attempt is made to search for any other representation of the encoded symbol. If that is required, the user can specify equivalence classes containing all variations of the desired graphic symbol. The definition of ``single character'' has been expanded to include also collating elements consisting of two or more characters; this expansion 1 is applicable only when a bracket expression is included in the BRE or 1 ERE. An example of such a collating element may be the Dutch ``ij'', 1 which collates as a ``y.'' In some encodings, a ligature ``i with j'' exists _a_s _a _c_h_a_r_a_c_t_e_r, and would represent a single-character collating element. In another encoding, no such ligature exists, and the two- character sequence ``ij'' is defined as a multicharacter collating element. Outside brackets, the ``ij'' is treated as a two-character RE and will match the same characters in a string. Historically, a bracket expression only matched a single character. If, however, the bracket expression defines, for example, a range that includes ``ij'', then this particular bracket expression will also match a sequence of the two characters ``i'' and ``j'' in the string. 2.8.6.2 Regular Expression General Requirements Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Historically, most regular expression implementations only match lines, not strings. However, that is more an effect of the usage than of an inherent feature of regular expressions itself. Consequently, POSIX.2 does not regard s as special; they are ordinary characters, and both a period and a nonmatching list can match them. Those utilities (like grep) that do not allow s to match are responsible for eliminating any from strings before matching against the RE. The _r_e_g_c_o_m_p() function, however, can provide support for such processing without violating the rules of this clause. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 148 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The definition of case-insensitive processing is intended to allow matching of multicharacter collating elements as well as characters. For instance, as each character in the string is matched using both its cases, the RE [[.Ch.]], when matched against char, is in reality matched against ch, Ch, cH, and CH. 1 Some implementations of egrep have had very limited flexibility in handling complex extended regular expressions. POSIX.2 does not attempt to define the complexity of a BRE or ERE, but does place a lower limit on it--any regular expression must be handled, as long as it can be expressed in 256 bytes or less. (Of course, this does not place an upper limit on the implementation.) There are existing programs using a nondeterministic-recognizer implementation that should have no difficulty with this limit. It is possible that a good approach would be to attempt to use the faster, but more limited, deterministic recognizer for simple expressions and to fall back on the nondeterministic recognizer for those expressions requiring it. Nondeterministic implementations must be careful to observe the 2.8.1.2 rules on which match is chosen; the longest match, not the first match, starting at a given character is used. The term ``invalid'' highlights a difference between this clause and some 1 others: POSIX.2 frequently avoids mandating of errors for syntax 1 violations because they can be used by implementors to trigger 1 extensions. However, the authors of the internationalization features of 1 regular expressions desired to mandate errors for certain conditions to 1 identify usage problems or nonportable constructs. These are identified 1 within this rationale as appropriate. The remaining syntax violations 1 have been left implicitly or explicitly undefined. For example, the BRE 1 construct \{1,2,3\} does not comply with the grammar. A conforming 1 application cannot rely on it producing an error nor matching the literal 1 characters \{1,2,3\}. The term ``undefined'' was used in favor of 1 ``unspecified'' because many of the situations are considered errors on 1 some implementations and it was felt that consistency throughout the 1 clause was preferable to mixing undefined and unspecified. 1 2.8.6.3 Basic Regular Expressions Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) 2.8.6.3.1 BREs Matching a Single Character or Collating Element Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 149 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.8.6.3.2 RE Bracket Expression Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) If a bracket expression must specify both - and ], then the ] must be placed first (after the ^, if any) and the - last within the bracket expression. Range expressions are, historically, an integral part of regular expressions. However, the requirements of ``natural language behavior'' and portability does conflict: ranges must be treated according to the current collating sequence, and include such characters that fall within the range based on that collating sequence, regardless of character values. This, however, means that the interpretation will differ depending on collating sequence. If, for instance, one collating sequence defines ``a'..' as a variant of ``a'', while another defines it as a letter following ``z'', then the expression [a-..z] is valid in the first language and invalid in the second. This kind of ambiguity should be avoided in portable applications, and therefore the working group elected to state that ranges must not be used in strictly conforming applications; however, implementations must support them. Some historical implementations allow range expressions where the ending range point of one range is also the starting point of the next (for instance [a-m-o]). This behavior should not be permitted, but to avoid breaking existing implementations, it is now _u_n_d_e_f_i_n_e_d whether it is a valid expression, and how it should be interpreted. Current practice in awk and lex is to accept escape sequences in bracket expressions as per Table 2-15, while the normal regular expression behavior is to regard such a sequence as consisting of two characters. Allowing the awk/lex behavior in regular expressions would change the normal behavior in an unacceptable way; it is expected that awk and lex will decode escape sequences in regular expressions before passing them to _r_e_g_c_o_m_p() or comparable routines. Each utility describes the escape sequences it accepts as an exception to the rules in this clause; the list is not the same, for historical reasons. As noted earlier, the new syntax and rules have been added to accommodate other languages than English. These modifications were proposed by the UniForum Subcommittee on Internationalization and accepted by the working group. The remainder of this clause describes the rationale for these modifications. _I_n_t_e_r_n_a_t_i_o_n_a_l_i_z_a_t_i_o_n__R_e_q_u_i_r_e_m_e_n_t_s The goal of the internationalization effort was to provide functions and capabilities that matched the capabilities of existing implementations, but that adhered to the user's local customs, rules, and environment. This has also been described as ``removing the ASCII (and English Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 150 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 language) bias.'' In addition, other requirements also influence the standardization efforts, such as _p_o_r_t_a_b_i_l_i_t_y, _e_x_t_e_n_s_i_b_i_l_i_t_y, and _c_o_m_p_a_t_i_b_i_l_i_t_y. In a worldwide environment _p_o_r_t_a_b_i_l_i_t_y carries much weight. Wherever feasible, users should be given the capability to develop code that can execute independently of character set, code set, or language. Standards must also be _e_x_t_e_n_s_i_b_l_e; to support further development, to allow for local or regional extensions, or to accommodate new concepts (such as multibyte characters). _C_o_m_p_a_t_i_b_i_l_i_t_y does not only refer to support of existing code, but also to making the new syntax, semantics, and functions compatible with existing environments and implementations. _I_n_t_e_r_n_a_t_i_o_n_a_l_i_z_a_t_i_o_n__T_e_c_h_n_i_c_a_l__B_a_c_k_g_r_o_u_n_d The C Standard {7} (and, by implication, also POSIX) recognizes that the ASCII character set used in historical UNIX system implementations is not adequate outside the Anglo-American language area. It is, however, not enough to remove the ASCII bias; the dependency on Anglo-Saxon conventions and rules must also be broadened to accommodate other cultures, including those that require thousands of characters. Character sets are defined by their _a_t_t_r_i_b_u_t_e_s; typical attributes are the _e_n_c_o_d_i_n_g, the _c_o_l_l_a_t_i_n_g _s_e_q_u_e_n_c_e, the _c_h_a_r_a_c_t_e_r _c_l_a_s_s_i_f_i_c_a_t_i_o_n, and the _c_a_s_e _m_a_p_p_i_n_g. It is also recognized that, even within one language area, several combinations of attributes exist: character set attributes are _m_u_t_a_b_l_e and _c_o_m_b_i_n_a_t_o_r_y. So, rather than replacing one straitjacket by another, the proposed standards make character sets _u_s_e_r-_d_e_f_i_n_a_b_l_e and _p_r_o_g_r_a_m- _s_e_l_e_c_t_a_b_l_e. The existence of character set attributes is implicit in regular expressions (REs). This implies that regular expressions must recognize and adapt to the _p_r_o_g_r_a_m-_s_e_l_e_c_t_e_d set of attributes. A program _s_e_l_e_c_t_s the appropriate character set (or combination of attributes) using the mechanism described in 2.5. The _d_e_f_i_n_i_t_i_o_n of a character set (its attributes) is _e_x_t_e_r_n_a_l to an executing program. Many combinations of attributes can exist concurrently. Of particular interest are the following attributes: (1) _C_o_l_l_a_t_i_n_g _S_e_q_u_e_n_c_e. In existing implementations, the _e_n_c_o_d_e_d ASCII ordering matches the _l_o_g_i_c_a_l English collating sequence. This correspondence does not exist for all code sets or Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 151 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX languages. In addition, many languages employ concepts that have no counterparts in English collation: (a) In many languages, ordering is based on the concept of _s_t_r_i_n_g _c_o_l_l_a_t_i_o_n rather than _c_h_a_r_a_c_t_e_r _c_o_l_l_a_t_i_o_n as in English. One of the effects of this is that the ordering is based on _c_o_l_l_a_t_i_n_g _e_l_e_m_e_n_t_s rather than on characters. Characters typically map into collating elements: _O_n_e-_t_o-_o_n_e mapping, where a character is also a collating element, _O_n_e-_t_o-_N mapping, where a single character maps into two or more collating elements (as the German ``B'' (eszet), which collates as ``ss''), _N-_t_o-_o_n_e mapping, where two or more characters map into one collating element (as in the Spanish ``ll'', which collates between ``l'' and ``m''; i.e., a word beginning with ``ll'' collates _a_f_t_e_r a word beginning with ``lo''). (b) A common method for adding characters to an alphabet is to use diacritical marks, such as accents or circumflex ( ^). In some languages, this creates a completely new c`h'aracter, collated differently from the Latin ``base.'' In other languages these accented characters are collated as variants of the Latin base letter; i.e., they have the same relative order; they are _e_q_u_i_v_a_l_e_n_t. If the strings (words) being compared are equal except for ``accents,'' the strings can be ordered based on a secondary ordering _w_i_t_h_i_n the ``equivalence class.'' For instance, in French, the words ``_t_a_c_h_e'', ``_t_^a_c_h_e'', and ``_t_a_c_h_e_t_e_r'' collate in that order. The C Standard {7} recognizes this; it includes new library functions capable of handling complex collation rules. These functions depend on the setting of the _s_e_t_l_o_c_a_l_e() category LC_COLLATE for a definition of the current collation rules. (2) _C_h_a_r_a_c_t_e_r _C_l_a_s_s_i_f_i_c_a_t_i_o_n. Character classification and case mapping is another area where each language (or even language area) has its own rules. Although users in different countries can use the same code set, such as ISO 8859-1 {5}, the definition of what constitutes a letter or an uppercase letter may vary. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 152 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The C Standard {7} recognizes this; library functions used to classify characters or perform case mapping depend on the _s_e_t_l_o_c_a_l_e() category LC_CTYPE for a definition of how characters map to character classes. _I_n_t_e_r_n_a_t_i_o_n_a_l_i_z_a_t_i_o_n__P_r_o_p_o_s_a_l__A_r_e_a_s Based on the requirements and attribute characteristics defined above, and after reviewing proposals and definitions by X/Open and other organizations, the UniForum Subcommittee on Internationalization decided to concentrate on the following areas: the range expression, character classes, the definition of one-character RE (multicharacter element), and equivalence classes. Most of these are heavily dependent on the current definition of collation sequence; the Subcommittee felt it natural to couple the capabilities and interpretation of bracket expressions closely to the requirements for extended collation capabilities. In addition, the Subcommittee felt that the capabilities described in 2.5 formed a suitable basis for runtime control of regular expression behavior. The Subcommittee realized that the mechanism selected requires changes in the existing syntax. As a rule, the Subcommittee wished to minimize changes and avoid syntactical changes that may cause existing regular expressions to fail. (1) _C_o_l_l_a_t_i_n_g _E_l_e_m_e_n_t_s _a_n_d _S_y_m_b_o_l_s. As noted above, many expressions within a bracket expression are closely connected with collation, and the Subcommittee defined many capabilities in terms of collating elements and collating symbols. A collating element is defined as a sequence of one or more bytes defined in the current collating sequence definition as a unit of collation. In most cases, a collating element is equal to a character, but the collation sequence may exclude some characters, or define two or more characters as a collating element. A one-character RE is, logically enough, defined as one character or something that translates into one character (the number of bits used to represent the character is not an issue here). The expression within square brackets is a one-character RE; i.e., single characters are matched against the list of single characters defined within the brackets. In Spanish, the phrase ``a _t_o _d'' means the sequence of collating elements a, a', b, c, ch, and d. Consequently, with a Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 153 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Spanish character set, the range statement [a-d] includes the ch collating element, even though it is expressed with two characters (N-to-1 mapping). The historical syntax, however, does not allow the user to define either the range from a through ch, or to define ch as a single character rather than as either c or h. The Subcommittee decided that N-to-1 mappings be recognized (if properly delimited), as _o_n_e-_c_h_a_r_a_c_t_e_r _R_E_s inside, but not outside, square brackets (e.g., a period will never match ch). To be distinguishable from a list of the characters themselves, the multicharacter element must be delimited from the remainder of the characters in the string. The characters [. _a_n_d .] are used to delimit a multicharacter collating element from other elements, and can be used to delimit single-character collating elements. (2) _E_q_u_i_v_a_l_e_n_c_e _C_l_a_s_s_e_s. As stated previously, many languages extend the Latin alphabet by using diacritical marks. In some cases, the Latin base character (e.g., a) and the accented versions of the base (e.g., a`, a^ in French) constitute a ``subclass'' of characters with some partially equivalent characteristics but different code values. Because these characters are related, they are often processed as a group. The historical syntax, however, does not provide for this in a portable manner. Although it represents an extension of the historical capabilities, the X/Open group strongly recommended that a properly delimited collating element be recognized as representing an equivalence class, that is as the collating element itself, and all other characters with the same primary order in the collation sequence. The Subcommittee supported this recommendation, and also selected [= and =] as delimiters for equivalence classes. (3) _R_a_n_g_e _E_x_p_r_e_s_s_i_o_n_s. The hyphen historically indicated ``a range of consecutive ASCII characters;'' typically it stands for the word ``to,'' as in ``a to z,'' _a_n_d _i_m_p_l_i_e_s _a_n _o_r_d_e_r_e_d _i_n_t_e_r_v_a_l. _I_n _A_S_C_I_I, _t_h_e _e_n_c_o_d_e_d _o_r_d_e_r _m_a_t_c_h_e_s _t_h_e _l_o_g_i_c_a_l _E_n_g_l_i_s_h _o_r_d_e_r; _t_h_i_s _i_s _n_o_t _t_r_u_e _w_i_t_h _o_t_h_e_r _e_n_c_o_d_i_n_g_s _o_r _w_i_t_h _o_t_h_e_r _a_l_p_h_a_b_e_t_s. _I_f _t_h_e _A_S_C_I_I _d_e_p_e_n_d_e_n_c_y _i_s _r_e_m_o_v_e_d, _a_n _a_l_t_e_r_n_a_t_i_v_e _c_o_u_l_d _h_a_v_e _b_e_e_n _t_o _u_s_e _t_h_e _e_n_c_o_d_e_d _s_e_q_u_e_n_c_e _o_f _w_h_a_t_e_v_e_r _c_o_d_e _s_e_t _i_s _c_u_r_r_e_n_t_l_y _u_s_e_d. _T_h_i_s, _h_o_w_e_v_e_r, _w_o_u_l_d _c_e_r_t_a_i_n_l_y _d_e_c_r_e_a_s_e _p_o_r_t_a_b_i_l_i_t_y, _a_s _w_e_l_l _a_s _r_e_q_u_i_r_i_n_g _t_h_e _u_s_e_r _t_o _k_n_o_w _t_h_e _o_r_d_e_r_i_n_g Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 154 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 _o_f _t_h_e _c_u_r_r_e_n_t _c_o_d_e _s_e_t. _I_t _w_o_u_l_d _a_l_s_o _m_o_s_t _c_e_r_t_a_i_n_l_y _b_e _c_o_u_n_t_e_r-_i_n_t_u_i_t_i_v_e; _a _F_r_e_n_c_h _u_s_e_r _w_o_u_l_d _e_x_p_e_c_t _t_h_e _e_x_p_r_e_s_s_i_o_n [_a-_d] to match any of the letters a, a` a^, b, c, c, or d. The Subcommittee regards this interpretation of ranges as most compatible with existing capabilities, and one that provides for the desired portability. As the _l_o_g_i_c_a_l ordering need not be inherent in the _e_n_c_o_d_e_d sequence, an external definition was required. Such a definition was already present via the _c_o_l_l_a_t_i_n_g _s_e_q_u_e_n_c_e attribute of the character set. The _s_e_t_l_o_c_a_l_e() function provides for an LC_COLLATE category, which defines the current collating sequence. The Subcommittee selected this as the basis for the interpretation of ranges, as well as of equivalence classes and multicharacter collating symbols. (4) _C_h_a_r_a_c_t_e_r _C_l_a_s_s_e_s. The _r_a_n_g_e expression is commonly used to indicate a _c_h_a_r_a_c_t_e_r _c_l_a_s_s; the _e_x(_a_u__c_m_d) section of the _S_V_I_D states: ``... _a _p_a_i_r _o_f _c_h_a_r_a_c_t_e_r_s _s_e_p_a_r_a_t_e_d _b_y - defines a range (e.g., a-z defines any lowercase letter)....'' In reality, [a-z] means ``any lowercase letter between a and z, inclusive.'' This is _o_n_l_y equivalent to ``any lowercase letter'' if the _a is the first and z is the last lowercase letter in the collating sequence. To provide the intended capabilities in a portable way, the Subcommittee introduced a new syntactical element, namely an explicit _c_h_a_r_a_c_t_e_r _c_l_a_s_s. The definition of which characters constitute a specific character class is already present via the LC_CTYPE category of the _s_e_t_l_o_c_a_l_e() function. The Subcommittee selected the identification of character classes by _n_a_m_e, bracketed by [: and :]. A character class cannot be used as an endpoint in a range statement. _I_n_t_e_r_n_a_t_i_o_n_a_l_i_z_a_t_i_o_n__S_y_n_t_a_x The Subcommittee was careful to propose changes in the regular expression syntax that minimize the impact on existing REs. In evaluating alternatives, the Subcommittee looked at ease of use (terseness, ease to remember, keyboard availability), impact on historical REs (compatibility), implementability, performance and how error-prone the syntax is likely to be (ambiguity). The Subcommittee made the following evaluation: (1) Syntax changes must be limited to expressions within square brackets. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 155 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX (2) Strings or characters with special meaning must be delimited from ordinary strings, to avoid compatibility problems. (3) Both initial and terminating delimiter should consist of two characters, to minimize compatibility and ambiguity problems. (4) Outer delimiter character should be bracketing; i.e., naturally indicate initial and terminating side. Examples: {} <> (). (5) The brackets ([]) are, due to the special rules for ``brackets within brackets,'' rather unlikely to be used in the intended way (a closing bracket must precede an open bracket in the existing syntax). (6) To minimize ambiguity, brackets must be paired with another character. Many other symbols are already in use, either within regular expressions, or in the shell. Examples of usable characters are: = . : (7) Because a multicharacter collating element also can be a member of an equivalence class, different delimiters must be chosen for these two expressions. Also, the character class expression must be distinguishable from, e.g., multicharacter collating symbols; although no historical example is known to the Subcommittee, prudence dictated that character classes be given separate delimiters. (8) The Subcommittee selected the period as the secondary delimiter for multicharacter collating symbols. (9) The Subcommittee selected the equals-sign as the secondary delimiter for equivalence classes. (10) The Subcommittee selected the colon as the secondary delimiter for character classes. The specific syntax and facilities described in this clause represent a coalescence of proposals and implementations from several vendors. Due to differences in facilities and syntax, it was not possible to take one implementation and codify it. There are now several implementations closely patterned on the existing proposal. The facilities presented in this clause are described in a manner that does not preclude their use with multibyte character sets. However, no attempt has been made to include facilities specifically intended for such character sets. The definitions of character classes is tied to the LC_CTYPE definition. The set of character classes defined in the C Standard {7} represents the Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 156 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 minimum set of character classes required worldwide, i.e., those required by all implementations. It is the working group's belief that local standards bodies, as well as individual vendors, will provide extensions to the standard in these areas, for instance to provide, for example, Kanji character classes. In many historical implementations, an _i_n_v_a_l_i_d _r_a_n_g_e is treated as if it consisted of the endpoints only. For example, [z-a] is treated as [za]. Some implementations treat the above range as [z], and others as [-az]. Neither is correct, and the working group decided that this should be treated as an error. It was proposed that the syntax for bracket expressions be simplified such that the ``extra'' brackets are not needed if the bracket expression only consists of a character class, an equivalence class, or a collating symbol: ``[:alpha:]'' instead of ``[[:alpha:]]''. To ensure unambiguity, if a bracket expression starts with :, =, or ., then it cannot contain a class expression or a collating symbol (or duplicated characters). In addition, it was also proposed that only valid class or collating symbol expressions be accepted: e.g., [[:ctrl:]] is an invalid expression. The working group rejected the proposal. While the syntax [:alpha:] may be intuitive to some, the proposal does not allow, e.g., [:digit:.ch.]. The alternative, to require additional brackets for the latter case would probably cause more errors than the historical syntax. Requiring erroneous class expressions or collating symbols to make the regular expression invalid may minimize the risks for inadvertent spelling errors. However, at this point it was judged that this would reduce consensus. Consideration was given to eliminating the [.ch.] syntax and providing that collating element should be recognized as such both inside and outside bracket expressions. In addition, consideration was given to defining character classes such that collating elements are included. The working group rejected these proposals. The [.ch.] syntax is only required inside bracket expressions due to the fact that a bracket expression historically only matched a single character. If ch is a collating element, a range [a-z] (if ``ch'' falls within it) matches ch. Outside brackets, an expression ch is treated as two concatenated characters, matching the string ``ch''. The [.ch.] expression is intended to allow the specification of a multicharacter collating element separately from ranges in a bracket expression. Character classes are not intended to include collating elements; there is no requirement that all characters in a multicharacter collating element belong to the same character class (for instance ``Ch'' is ``alpha'' but neither ``upper'' nor ''lower''). Introducing collating elements in character classes would be nonintuitive. It was suggested that, because ranges may or may not be meaningful (or even accepted) based on the current collating sequence, they should be Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 157 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX eliminated from the syntax (or at least marked obsolescent). It was suggested that, e.g., [z-a] should always be or never be an error, regardless of collating sequence. The working group did not wish to eliminate ranges from the syntax. While it is true that ranges may not be universally portable, they are nevertheless a useful and fundamental construct in regular expressions. The regular expression syntax has consciously been extended to provide both increased portability and extended local capabilities. Where supported, ranges must reflect the current collating sequence. The working group instead elected to include range expressions as an implementation requirement, but state that strictly conforming applications (but not, e.g., National-Body-conforming applications) shall not use range expressions. Treating erroneous ranges as invalid points out that these may not be portable across collating sequences; and is better than (silently) making them behave in a way contrary to the intents of the user. Earlier drafts allowed the use of an equivalence class expression as the 2 starting or ending point of a range expression, such as [[=e=]-f]. This 2 now produces unspecified results because it is possible to define the 2 equivalence class as a disjoint set of characters. This example could 2 produce different results on various systems: 2 - An error. 2 - The equivalent of [[=e=]e-f] (which is the correct portable way to 2 include equivalence class effects in a bracket expression). 2 - All of the collating elements from the lowest value found in the 2 equivalence class, including any of the elements found between the 2 disjoint values. 2 Consideration was given to saying that equivalence classes with disjoint 2 elements produce unspecified results at the start or end of a range, but 2 since the application cannot predict which equivalence classes are 2 disjoint, this is no improvement over the more general statement chosen. 2 It was suggested that, while reference to nonprintable characters is partially supported by the proposed set of character classes, the specificity is not precise enough, and that additional character classes should be supported, e.g., [:tab:] or [:a:]. The working group rejected this proposal, because this feature would represent a substantial enhancement to the current regular expression syntax, and one that cannot be based on internationalization requirements. It is judged that its inclusion would reduce consensus. A future revision of regular expressions should study the capability to create temporary character classes for use in regular expressions; a ``character class macro facility.'' Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 158 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.8.6.3.3 BREs Matching Multiple Characters Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The limit of nine backreferences to subexpressions in the RE is based on the use of a single digit identifier; increasing this to multiple digits would break historical applications. This does not imply that only nine 1 subexpressions are allowed in REs. The following is a valid BRE with ten 1 subexpressions: 1 \(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)* 1 The working group regards the common current behavior, which supports \_n*, but not \_n\{_m_i_n,_m_a_x\}, or \(...\)*, or \(...\)\{_m_i_n,_m_a_x\}, as a nonintentional result of a specific implementation, and supports both duplication and interval expressions following subexpressions and backreferences. 2.8.6.3.4 Expression Anchoring Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Often, the dollar-sign is viewed as matching the ending in text files. This is not strictly true; the is typically eliminated from the strings to be matched and the dollar-sign matches the terminating null character. The ability of ^, $, and * to be nonspecial in certain circumstances may 1 be confusing to some programmers, but this situation was changed only in 1 a minor way from historical practice to avoid breaking many existing 1 scripts. Some consideration was given to making the use of the anchoring 1 characters undefined if not escaped and not at the beginning or end of 1 strings. This would cause a number of historical BREs, such as 2^10, 1 $HOME, and $1.35, which relied on the characters being treated literally, 1 to become invalid. 1 However, one relatively uncommon case was changed to allow an extension 1 used on some implementations. Historically, the BREs ^foo and \(^foo\) 1 did not match the same string, despite the general rule that 1 subexpressions and entire BREs match the same strings. To achieve 1 balloting consensus, POSIX.2 has allowed an extension on some systems to 1 treat these two cases in the same way by declaring that anchoring _m_a_y 1 occur at the beginning or end of a subexpression. Therefore, portable 1 BREs that require a literal circumflex at the beginning or a dollar-sign 1 at the end of a subexpression must escape them. Note that a BRE such as 1 a\(^bc\) will either match a^bc or nothing on different systems under the 1 POSIX.2 rules. 1 ERE anchoring has been different from BRE anchoring in all historical 1 systems. An unescaped anchor character has never matched its literal 1 counterpart outside of a bracket expression. Some systems treated 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.8 Regular Expression Notation 159 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX foo$bar as a valid expression that never matched anything, others treated 1 it as invalid. POSIX.2 mandates the former, valid unmatched behavior. 1 Some systems have extended the BRE syntax to add alternation. For 1 example, the subexpression \(foo$\|bar\) would match either foo at the 1 end of the string or bar anywhere. The extension is triggered by the use 1 of the undefined \| sequence. Because the BRE is undefined for portable 1 scripts, the extending system is free to make other assumptions, such as 1 that the $ represents the end-of-line anchor in the middle of a 1 subexpression. If it were not for the extension, the $ would match a 1 literal dollar-sign under the POSIX.2 rules. 1 2.8.6.4 Extended Regular Expressions Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) As with basic regular expressions, the working group decided to make the interpretation of escaped ordinary characters undefined. The right-parenthesis is not listed as an ERE special character because 1 it is only special in the context of a preceding left-parenthesis. If 1 found without a preceding left-parenthesis, the right-parenthesis has no 1 special meaning. 1 Based on objections in several ballots, the _i_n_t_e_r_v_a_l _e_x_p_r_e_s_s_i_o_n, {_m,_n}, has been added to extended regular expressions. Historically, the interval expression has only been supported in some extended regular expression implementations. The working group estimated that the addition of interval expressions to extended regular expressions would not decrease consensus, and would also make basic regular expressions more of a subset of extended regular expressions than in many historical implementations. It was suggested that, in addition to interval expressions, backreferences (\_n) also should be added to extended regular expressions. This was rejected by the working group as likely to decrease consensus. In historical implementations, multiple duplication symbols are usually interpreted from left to right and treated as additive. As an example, a+*b matches zero or more instances of a followed by a b. In POSIX.2, multiple duplication symbols are undefined; i.e., they cannot be relied upon for portable applications. One reason for this is to provide some scope for future enhancements; the current syntax is very crowded. The precedence of operations differs between EREs and those in lex; in lex, for historical reasons, interval expressions have a lower precedence than concatenation. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 160 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.8.6.5 Regular Expression Grammar Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) None. END_RATIONALE 2.9 Dependencies on Other Standards 2.9.1 Features Inherited from POSIX.1 This subclause describes some of the features provided by POSIX.1 {8} that are assumed to be globally available by all systems conforming to POSIX.2. This subclause does not attempt to detail all of the POSIX.1 {8} features that are required by all of the utilities and functions defined in this standard; the utility and function descriptions point out additional functionality required to provide the corresponding specific features needed by each. The following subclauses describe frequently used concepts. Utility and function description statements override these defaults when appropriate. BEGIN_RATIONALE 2.9.1.0.1 Features Inherited from POSIX.1 Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) It has been pointed out that POSIX.2 assumes that a lot of POSIX.1 {8} functionality is present, but never states exactly how much. This is an attempt to clarify the assumptions. This subclause only covers the ``utilities and functions defined by this standard.'' It does not mandate that the specific POSIX.1 {8} interfaces themselves be available to all application programs. A C language program compiled on a POSIX.2 system is not guaranteed that any of the POSIX.1 {8} functions are accessible. (For example, although UNIX system-based implementations of ls will use _s_t_a_t() to get file status, a POSIX.2 implementation of ls on a ``LONG_NAME_OS-based'' implementation might use the _g_e_t__f_i_l_e__a_t_t_r_i_b_u_t_e_s() and the _g_e_t__f_i_l_e__t_i_m_e__s_t_a_m_p_s() system calls.) POSIX.2 only requires equivalent functionality, not equal means of access. In any event, programs requiring the POSIX.1 {8} system interface should specify that they need POSIX.1 {8} conformance and not hope to achieve it by piggybacking on POSIX.2. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.9 Dependencies on Other Standards 161 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.9.1.1 Process Attributes The following process attributes, as described in POSIX.1 {8}, are assumed to be supported for all processes in POSIX.2: controlling terminal real group ID current working directory real user ID effective group ID root directory effective user ID saved set-group-ID file descriptors saved set-user-ID file mode creation mask session membership process ID supplementary group IDs process group ID A conforming implementation may include additional process attributes. BEGIN_RATIONALE 2.9.1.1.1 Process Attributes Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The supplementary group IDs requirement is minimal. If {NGROUPS_MAX} is defined to be zero, they are not required. If {NGROUPS_MAX} is greater than zero, the supplementary group IDs are used as described in POSIX.1 {8} in various permission checking operations. The saved-set-group-ID and saved-set-user-ID requirements are also minimal. If {_POSIX_SAVED_IDS} is defined, they are required; otherwise, they are not. A controlling terminal is needed to control access to /dev/tty. The file creation semantics of POSIX.2 require the effective group ID, effective user ID, and the file mode creation mask. Pathname resolution and access permission checks require the current working directory, effective group ID, effective user ID, and root directory. The kill utility requires the effective group ID, effective user ID, process ID, process group ID, real group ID, real user ID, saved set- group-ID, saved set-user-ID, and session membership attributes to perform the various signal addressing and permission checks. The id utility is based on the effective group ID, effective user ID, real group ID, real user ID, and supplementary group IDs. The following process attributes described in POSIX.1 {8} do not seem to be required by POSIX.2: parent process ID, pending signals, process Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 162 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 signal mask, time left until an alarm clock signal, _t_m_s__c_s_t_i_m_e, _t_m_s__c_u_t_i_m_e, _t_m_s__s_t_i_m_e, and _t_m_s__u_t_i_m_e. There are probably other attributes mentioned in POSIX.1 {8} that are not listed here. END_RATIONALE 2.9.1.2 Concurrent Execution of Processes The following functionality of the POSIX.1 {8} _f_o_r_k() function shall be available on all POSIX.2 conformant systems: (1) Independent processes shall be capable of executing independently without either process terminating. (2) A process shall be able to create a new process with all of the attributes referenced in 2.9.1.1, determined according to the semantics of a call to the POSIX.1 {8} _f_o_r_k() function followed by a call in the child process to one of the POSIX.1 {8} _e_x_e_c functions. BEGIN_RATIONALE 2.9.1.2.1 Concurrent Execution of Processes Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The historical functionality of _f_o_r_k() is required, which permits the concurrent execution of independent processes. A system with a single thread of process execution is not an appropriate base upon which to build a POSIX.2 system. (This requirement was not explicitly stated in the 1988 POSIX.1, but is included in the current POSIX.1 {8}.) END_RATIONALE 2.9.1.3 File Access Permissions The file access control mechanism described by _f_i_l_e _a_c_c_e_s_s _p_e_r_m_i_s_s_i_o_n_s in 2.2.2.55 applies to all files on a conforming POSIX.2 implementation. BEGIN_RATIONALE 2.9.1.3.1 File Access Permissions Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The entire concept of file protections and access control is assumed to be handled as in POSIX.1 {8}. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.9 Dependencies on Other Standards 163 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.9.1.4 File Read, Write, and Creation When a file is to be read or written, the file shall be opened with an access mode corresponding to the operation to be performed. If file access permissions deny access, the requested operation shall fail. When a file that does not exist is created, the following POSIX.1 {8} features shall apply unless the utility or function description states otherwise: (1) The file's user ID is set to the effective user ID of the calling process. (2) The file's group ID is set to the effective group ID of the calling process or the group ID of the directory in which the file is being created. (3) The file's permission bits are set to: S_IROTH | S_IWOTH | S_IRGRP | S_IWGRP | S_IRUSR | S_IWUSR (see POSIX.1 {8} 5.6.1.2) except that the bits specified by the process's file mode creation mask are cleared. (4) The _s_t__a_t_i_m_e, _s_t__c_t_i_m_e, and _s_t__m_t_i_m_e fields of the file shall be updated as specified in _f_i_l_e _t_i_m_e_s _u_p_d_a_t_e in 2.2.2.69. (5) If the file is a directory, it shall be an empty directory; otherwise the file shall have length zero. (6) Unless otherwise specified, the file created shall be a regular file. When an attempt is made to create a file that already exists, the action shall depend on the file type: (1) For directories and FIFO special files, the attempt shall fail and the utility shall either continue with its operation or exit immediately with a nonzero status, depending on the description of the utility. (2) For regular files: (a) The file's user ID, group ID, and permission bits shall not be changed. (b) The file shall be truncated to zero length. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 164 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 (c) The _s_t__c_t_i_m_e and _s_t__m_t_i_m_e fields shall be marked for update. (3) For other file types, the effect is implementation defined. When a file is to be appended, the file shall be opened in a manner equivalent to using the O_APPEND flag, without the O_TRUNC flag, in the POSIX.1 {8} _o_p_e_n() call. BEGIN_RATIONALE 2.9.1.4.1 File Read, Write, and Creation Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) Even though it might be possible for a process to change the mode of a file to match a requested operation and change the mode back to its original state after the operation is completed, utilities are not allowed to do this unless the utility description states otherwise. As an example, the ed utility r command fails if the file to be read does not exist (even though it could create the file and then read it) or the file permissions do not allow read access [even though it could use the POSIX.1 {8} _c_h_m_o_d() function to make the file readable before attempting to open the file]. END_RATIONALE 2.9.1.5 File Removal When a directory that is the root directory or current working directory of any process is removed, the effect is implementation defined. If file access permissions deny access, the requested operation shall fail. Otherwise, when a file is removed: (1) Its directory entry shall be removed from the file system. (2) The link count of the file shall be decremented. (3) If the file is an empty directory (see 2.2.2.43): (a) If no process has the directory open, the space occupied by the directory shall be freed and the directory shall no longer be accessible. (b) If one or more processes have the directory open, the directory contents shall be preserved until all references to the file have been closed. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.9 Dependencies on Other Standards 165 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX (4) If the file is a directory that is not empty, the _s_t__c_t_i_m_e field shall be marked for update. (5) If the file is not a directory: (a) If the link count becomes zero: [1] If no process has the file open, the space occupied by the file shall be freed and the file shall no longer be accessible. [2] If one or more processes have the file open, the file contents shall be preserved until all references to the file have been closed. (b) If the link count is not reduced to zero, the _s_t__c_t_i_m_e field shall be marked for update. (6) The _s_t__c_t_i_m_e and _s_t__m_t_i_m_e fields of the containing directory shall be marked for update. BEGIN_RATIONALE 2.9.1.5.1 File Removal Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) This is intended to be a summary of the POSIX.1 {8} _u_n_l_i_n_k() and _r_m_d_i_r() requirements needed by POSIX.2. END_RATIONALE 2.9.1.6 File Time Values All files have the three time values described by _f_i_l_e _t_i_m_e_s _u_p_d_a_t_e in 2.2.2.69. BEGIN_RATIONALE 2.9.1.6.1 File Time Values Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) All three time stamps specified by POSIX.1 {8} are needed for utilities like find, ls, make, test, and touch to work as expected. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 166 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.9.1.7 File Contents When a reference is made to the contents of a file, _p_a_t_h_n_a_m_e, this means the equivalent of all of the data placed in the space pointed to by _b_u_f when performing the _r_e_a_d() function calls in the following POSIX.1 {8} operations: while (read (fildes, buf, nbytes) > 0) ; If the file is indicated by a pathname _p_a_t_h_n_a_m_e, the file descriptor shall be determined by the equivalent of the following POSIX.1 operation: fildes = open (pathname, O_RDONLY); The value of _n_b_y_t_e_s in the above sequence is unspecified; if the file is of a type where the data returned by _r_e_a_d() would vary with different values, the value shall be one that results in the most data being returned. If the _r_e_a_d() function calls would return an error, it is unspecified whether the contents of the file are considered to include any data from offsets in the file beyond where the error would be returned. BEGIN_RATIONALE 2.9.1.7.1 File Contents Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) This description is intended to convey the traditional behavior for all types of files. This matches the intuitive meaning for regular files, but the meaning is not always intuitive for other types of files. In particular, for FIFOs, pipes, and terminals it must be clear that the contents are not necessarily static at the time a file is opened, but they include the data returned by a sequence of reads until end-of-file is indicated. This is why the _o_p_e_n() call is specified, with the O_NONBLOCK flag not set. Some files, especially character special files, are sensitive to the size of a _r_e_a_d() request. The contents of the file are those resulting from proper choice of this size. END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.9 Dependencies on Other Standards 167 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.9.1.8 Pathname Resolution The pathname resolution algorithm described by _p_a_t_h_n_a_m_e _r_e_s_o_l_u_t_i_o_n in 2.2.2.104 shall be used by conforming POSIX.2 implementations. See also _f_i_l_e _h_i_e_r_a_r_c_h_y in 2.2.2.58. BEGIN_RATIONALE 2.9.1.8.1 Pathname Resolution Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The whole concept of hierarchical file systems and pathname resolution is assumed to be handled as in POSIX.1 {8}. END_RATIONALE 2.9.1.9 Changing the Current Working Directory 2 When the current working directory (see 2.2.2.159) is to be changed, 2 unless the utility or function description states otherwise, the 2 operation shall succeed unless a call to the POSIX.1 {8} _c_h_d_i_r() function 2 would fail when invoked with the new working directory pathname as its 2 argument. 2 2.9.1.9.1 Changing the Current Working Directory Rationale. (_T_h_i_s 2 _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) 2 This subclause covers the access permissions and pathname structures 2 involved with changing directories, such as with cd or (the UPE-extended) 2 ma

P r . Q s . R t / S u / T v 0 U w 1 V x 2 W y 3 X z 4 Y { 5 Z { 6 [ | 7 \ } 8 \ } 9 ] ~ __________________________________________________________________________________________________________________________________________________ Each symbolic name specified in Table 2-3 shall be included in the file and shall be mapped to a unique encoding value (except for those symbolic 1 names that are shown with identical glyphs). If the control characters 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 62 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 commonly associated with the symbolic names in Table 2-4 are supported by the implementation, the symbolic names and their corresponding encoding values shall be included in the file. Some of the values associated with 1 the symbolic names in this table also may be contained in Table 2-3. 1 Table 2-4 - Control Character Set __________________________________________________________________________________________________________________________________________________ 1 1 1 1 1 1 __________________________________________________________________________________________________________________________________________________ The following declarations can precede the character definitions. Each shall consist of the symbol shown in the following list, starting in column 1, including the surrounding brackets, followed by one of more s, followed by the value to be assigned to the symbol. The name of the coded character set for which the character set description file is defined. The characters of the name shall be taken from the set of characters with visible glyphs defined in 1 Table 2-3. 1 The maximum number of bytes in a multibyte character. This shall default to 1. An unsigned positive integer value that shall define the minimum number of bytes in a character for the encoded character set. The value shall be less than or equal to mb_cur_max. If not specified, the minimum number shall be equal to mb_cur_max. The escape character used to indicate that the characters following shall be interpreted in a special way, as defined later in this subclause. This shall default to backslash (\), which is the character glyph used in all the following text and examples, unless otherwise noted. The character, that when placed in column 1 of a charmap line, is used to indicate that the line shall be ignored. The default character shall be Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.4 Character Set 63 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX the number-sign (#). The character set mapping definitions shall be all the lines immediately following an identifier line containing the string CHARMAP starting in column 1, and preceding a trailer line containing the string END CHARMAP starting in column 1. Empty lines and lines containing a comment_char in the first column shall be ignored. Each noncomment line of the character set mapping definition (i.e., between the CHARMAP and END CHARMAP lines of the file) shall be in either of two forms: "%s %s %s\n", <_s_y_m_b_o_l_i_c-_n_a_m_e>, <_e_n_c_o_d_i_n_g>, <_c_o_m_m_e_n_t_s> or "%s...%s %s %s\n", <_s_y_m_b_o_l_i_c-_n_a_m_e>, <_s_y_m_b_o_l_i_c-_n_a_m_e>, <_e_n_c_o_d_i_n_g>, <_c_o_m_m_e_n_t_s> In the first format, the line in the character set mapping definition defines a single symbolic name and a corresponding encoding. A symbolic name is one or more characters from the set shown with visible glyphs in Table 2-3, enclosed between angle brackets. A character following an escape character shall be interpreted as itself; for example, the sequence ``<\\\>>'' represents the symbolic name ``\>'' enclosed between angle brackets. In the second format, the line in the character set mapping definition defines a range of one or more symbolic names. In this form, the symbolic names shall consist of zero or more nonnumeric characters from the set shown with visible glyphs in Table 2-3, followed by an integer formed by one or more decimal digits. The characters preceding the integer shall be identical in the two symbolic names, and the integer formed by the digits in the second symbolic name shall be equal to or greater than the integer formed by the digits in the first name. This shall be interpreted as a series of symbolic names formed from the common part and each of the integers between the first and the second integer, inclusive. As an example, ... is interpreted as the symbolic names , , , and , in that order. A character set mapping definition line shall exist for all symbolic names specified in Table 2-3, and shall define the coded character value that corresponds with the character glyph indicated in the table, or the coded character value that corresponds with the control character symbolic name. If the control characters commonly associated with the symbolic names in Table 2-4 are supported by the implementation, the symbolic name and the corresponding encoding value shall be included in the file. Additional unique symbolic names may be included. A coded character value can be represented by more than one symbolic name. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 64 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The encoding part shall be expressed as one (for single-byte character 1 values) or more concatenated decimal, octal, or hexadecimal constants in 1 the following formats: "%cd%d", <_e_s_c_a_p_e__c_h_a_r>, <_d_e_c_i_m_a_l _b_y_t_e _v_a_l_u_e> "%cx%x", <_e_s_c_a_p_e__c_h_a_r>, <_h_e_x_a_d_e_c_i_m_a_l _b_y_t_e _v_a_l_u_e> "%c%o", <_e_s_c_a_p_e__c_h_a_r>, <_o_c_t_a_l _b_y_t_e _v_a_l_u_e> Decimal constants shall be represented by two or three decimal digits, 2 preceded by the escape character and the lowercase letter d; for example, 2 \d05, \d97, or \d143. Hexadecimal constants shall be represented by two 2 hexadecimal digits, preceded by the escape character and the lowercase 2 letter x; for example, \x05, \x61, or \x8f. Octal constants shall be 2 represented by two or three octal digits, preceded by the escape 2 character; for example, \05, \141, or \217. In a portable charmap file, 2 each constant shall represent an 8-bit byte. Implementations supporting 2 other byte sizes may allow constants to represent values larger than 2 those that can be represented in 8-bit bytes, and to allow additional 2 digits in constants. When constants are concatenated for multibyte 2 character values, they shall be of the same type, and interpreted in byte 2 order from left to right. The manner in which constants are represented 2 in the character is implementation defined. Omitting bytes from a 2 multibyte character definition produces undefined results. 2 In lines defining ranges of symbolic names, the encoded value is the value for the first symbolic name in the range (the symbolic name preceding the ellipsis). Subsequent symbolic names defined by the range shall have encoding values in increasing order. For example, the line ... \d129\d254 shall be interpreted as \d129\d254 \d129\d255 \d130\d0 \d130\d1 The comment is optional. For the interpretation of the dollar-sign and the number-sign, see 2.2.2.37 and 2.2.2.93. BEGIN_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.4 Character Set 65 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.4.2 Character Set Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The portable character set is listed in full so there is no dependency on the ISO/IEC 646 {1} (or historically ASCII) encoded character set, although the set is identical to the characters defined in the International Reference Version of ISO/IEC 646 {1}. This standard poses no requirement that multiple character sets or code sets be supported, leaving this as a marketing differentiation for implementors. Although multiple _c_h_a_r_m_a_p files are supported, it is the responsibility of the implementation to provide the file(s); if only one is provided, only that one will be accessible using the localedef utility's -f option (although in the case of just one file on the system, -f is not useful). The statement about invariance in code sets for the portable character set is worded as it is to avoid precluding implementations where multiple incompatible code sets are available (say, ASCII and EBCDIC). The standard utilities cannot be expected to produce predictable results if they access portable characters that vary on the same implementation. The character set description file provides: - the capability to describe character set attributes (such as collation order or character classes) independent of character set encoding, and using only the characters in the portable character set. This makes it possible to create ``generic'' localedef source files for all code sets that share the portable character set (such as the ISO 8859 family or IBM Extended ASCII). - standardized symbolic names for all characters in the portable character set, making it possible to refer to any such character regardless of encoding. Implementations are free to describe more than one code set in a character set description file, as long as only one encoding exists for the characters in Table 2-3. For example, if an implementation defines ISO 8859-1 {5} as the primary code set, and ISO 8859-2 {6} as an alternate set, with each character from the alternate code set preceded in data by a shift code, a character set description file could contain a complete description of the primary set and those characters from the secondary that are not identical, the encoding of the latter including the shift code. Implementations are free to choose their own symbolic names, as long as the names identified by this standard are also defined; this provides support for already existing ``character names.'' Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 66 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The names selected for the members of the portable character set follow the ISO 8859 {5} and the ISO/IEC 10646 {B11} standards. However, several commonly used UNIX system names occur as synonyms in the list: - The traditional UNIX system names are used for control characters. - The word ``slash'' is in addition to ``solidus.'' 1 - The word ``backslash'' is in addition to ``reverse-solidus.'' 1 - The word ``hyphen'' in addition to ``hyphen-minus.'' - The word ``period'' in addition to ``full-stop.'' - For the digits, the word ``digit'' is eliminated. - For letters, the words ``Latin Capital Letter'' and ``Latin Small Letter'' are eliminated. - The words ``left-brace'' and ``right-brace'' in addition to ``left-curly-bracket'' and ``right-curly-bracket.'' - The names of the digits are preferred over the numbers, to avoid possible confusion between ``0'' and ``O'', and between ``1'' and ``l'' (one and the letter ell). The names for the control characters in Table 2-4 were taken from ISO 4873 {4}. The charmap file was introduced to resolve problems with the portability of, especially, localedef sources. This standard assumes that the 1 portable character set is constant across all locales, but does not 1 prohibit implementations from supporting two incompatible codings, such 1 as both ASCII and EBCDIC. Such ``dual-support'' implementations should 1 have all charmaps and localedef sources encoded using one portable 1 character set, in effect ``cross-compiling'' for the other environment. 1 Naturally, charmaps (and localedef sources) are only portable without 1 transformation between systems using the same encodings for the portable 1 character set. They can, however, be transformed between two sets using 1 only a subset of the actual characters (the portable set). However, the 1 particular coded character set used for an application or an 1 implementation does not necessarily imply different characteristics or collation: on the contrary, these attributes should in many cases be identical, regardless of code set. The charmap provides the capability to define a common locale definition for multiple code sets (the same localedef source can be used for code sets with different extended characters; the ability in the charmap to define ``empty'' names allows for characters missing in certain code sets). Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.4 Character Set 67 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX In addition, several implementors have expressed an interest in using the charmap concept to provide the information required for support of multiple character sets. Examples of such information is encoding mechanism, string parsing rules, default font information, etc. Such extensions are not described here. The declaration was added at the request of the international community to ease the creation of portable _c_h_a_r_m_a_p files on terminals not implementing the default backslash escape. (This approach was adopted because this is a new interface invented by POSIX.2. Historical interfaces, such as the shell command language and awk, have not been modified to accommodate this type of terminal.) The declaration was added at the request of the international community to eliminate the potential confusion between the number sign and the pound sign. The octal number notation with no leading zero required was selected to 1 match those of awk and tr and is consistent with that used by localedef. 1 To avoid confusion between an octal constant and the backreferences used 1 in localedef source, the octal, hexadecimal, and decimal constants must 1 contain at least two digits. As single-digit constants are relatively 1 rare, this should not impose any significant hardship. Each of the 1 constants includes ``two or more'' digits to account for systems in which 1 the byte size is larger than eight bits. For example, a Unicode system 1 that has defined 16-bit bytes may require six octal, four hexadecimal, 1 and five decimal digits. 1 The decimal notation is supported because some newer international standards define character values in decimal, rather than in the old column/row notation. The charmap identifies the coded character sets supported by an implementation. At least one charmap must be provided, but no implementation is required to provide more than one. Likewise, implementations can allow users to generate new charmaps (for instance for a new version of the 8859 family of coded character sets), but does not have to do so. If users are allowed to create new charmaps, the system documentation must describe the rules that apply (for instance: ``only coded character sets that are supersets of ISO/IEC 646 {1} IRV, no multibyte characters, etc.'') END_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 68 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 2.5 Locale A _l_o_c_a_l_e is the definition of the subset of a user's environment that depends on language and cultural conventions. It is made up from one or more categories. Each category is identified by its name and controls specific aspects of the behavior of components of the system. Category names correspond to the following environment variable names: LC_CTYPE Character classification and case conversion. LC_COLLATE Collation order. LC_TIME Date and time formats. LC_NUMERIC Numeric, nonmonetary formatting. LC_MONETARY Monetary formatting. LC_MESSAGES Formats of informative and diagnostic messages and interactive responses. Conforming implementations shall provide the standard utilities and the 1 interfaces in Annex B (if that option is supported) with the capability 1 to modify their behavior based on the current locale, as defined in the 1 Environment Variables subclause for each utility and interface. 1 Locales other than those supplied by the implementation can be created via the localedef utility (see 4.35), provided that the {POSIX2_LOCALEDEF} symbol is defined on the system; see 2.13.2. Otherwise, only the implementation-provided locale(s) can be used. The input to the utility is described in 2.5.2. The value that shall be used to specify a locale when using environment variables shall be the string specified as the _n_a_m_e operand to the localedef utility when the locale was created. The strings "C" and "POSIX" are reserved as identifiers for the POSIX Locale (see 2.5.1.) When the value of a locale environment variable begins with a slash (/), it shall be interpreted as the pathname of the locale definition. If the value of the locale value does not begin with a slash, the mechanism used to locate the locale is implementation defined. If different character sets are used by the locale categories, the results achieved by an application utilizing these categories is undefined. Likewise, if different code sets are used for the data being processed by interfaces whose behavior is dependent on the current locale, or the code set is different from the code set assumed when the locale was created, the result is also undefined. BEGIN_RATIONALE Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 69 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX 2.5.0.1 Locale Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The description of locales is based on work performed in the UniForum Technical Committee Subcommittee on Internationalization. Wherever appropriate, keywords were taken from the C Standard {7} or the _X/_O_p_e_n _P_o_r_t_a_b_i_l_i_t_y _G_u_i_d_e {B31}. The value that shall be used to specify a locale when using environment variables is the name specified as the _n_a_m_e operand to the localedef utility when the locale was created. This provides a verifiable method to create and invoke a locale. The ``object'' definitions need not be portable, as long as ``source'' definitions are. Strictly speaking, ``source'' definitions are portable only between implementations using the same character set(s). Such ``source'' definitions can, if they use symbolic names only, easily be ported between systems using different code sets as long as the characters in the portable character set (Table 2-3) have common values between the code sets; this is frequently the case in historical implementations. Of course, this requires that the symbolic names used for characters outside the portable character set are identical between character sets. The definition of symbolic names for characters is outside the scope of this standard, but is certainly within the scope of other standards organizations. When such names are standardized, future versions of POSIX.2 should require the use of these names. Applications can select the desired locale by invoking the _s_e_t_l_o_c_a_l_e() function (or equivalent) with the appropriate value. If the function is invoked with an empty string, the value of the corresponding environment variable is used. If the environment variable is unset or is set to the empty string, the implementation sets the appropriate environment as defined in 2.6. END_RATIONALE 2.5.1 POSIX Locale Conforming implementations shall provide a _P_O_S_I_X _L_o_c_a_l_e. The behavior of standard utilities in the POSIX Locale shall be as if the locale was defined via the localedef utility with input data from Table 2-5, Table 2-7, Table 2-9, Table 2-10, Table 2-8, and Table 2-11, all in 2.5.2. The tables describe the characteristics and behavior of the POSIX Locale for data consisting entirely of characters from the portable character set in Table 2-3 and the control characters in Table 2-4. For characters other than those in the two tables, the behavior is unspecified. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 70 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 The POSIX Locale can be specified by assigning the appropriate environment variables the values "C" or "POSIX". Table 2-5 shows the definition for the LC_CTYPE category. Table 2-7 shows the definition for the LC_COLLATE category. Table 2-8 shows the definition for the LC_MONETARY category. Table 2-9 shows the definition for the LC_NUMERIC category. Table 2-10 shows the definition for the LC_TIME category. Table 2-11 shows the definition for the LC_MESSAGES category. BEGIN_RATIONALE 2.5.1.1 POSIX Locale Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The POSIX Locale is equal to the "C" locale, as specified in POSIX.1 {8}. To avoid being classified as a C-language function, the name has been changed to the _P_O_S_I_X _L_o_c_a_l_e; the environment variable value can be either "POSIX", or, for historical reasons, "C". The POSIX definitions mirror the historical UNIX system behavior. The use of symbolic names for characters in the tables does not imply that the POSIX Locale must be described using symbolic character names, but merely that it may be advantageous to do so. Implementations must define a locale as the ``default'' locale, to be invoked when no environment variables are set, or set to the empty string. This default locale can be the POSIX Locale or any other, implementation-defined locale. Some implementations may provide facilities for local installation administrators to set the default locale, customizing it for each location. This standard does not require such a facility. 1 END_RATIONALE 1 2.5.2 Locale Definition The capability to specify additional locales to those provided by an implementation is optional (see 2.13.2). If the option is not supported, only implementation-supplied locales are available. Such locales shall be documented using the format specified in this clause. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 71 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX Locales can be described with the file format presented in this subclause. The file format is that accepted by the localedef utility (see 4.35). For the purposes of this subclause, the file is referred to as the _l_o_c_a_l_e _d_e_f_i_n_i_t_i_o_n _f_i_l_e, but no locales shall be affected by this file unless it is processed by localedef or some similar mechanism. Any 1 requirements in this subclause imposed upon ``the utility'' shall apply 1 to localedef or to any other similar utility used to install locale 1 information using the locale definition file format described here. 1 The locale definition file shall contain one or more locale category source definitions, and shall not contain more than one definition for the same locale category. If the file contains source definitions for more than one category, implementation-defined categories, if present, shall appear after the categories defined by this clause (2.5). A category source definition shall contain either the definition of a category or a copy directive. For a description of the copy directive, see 4.35. In the event that some of the information for a locale category, as specified in this standard, is missing from the locale source definition, the behavior of that category, if it is referenced, is unspecified. A category source definition shall consist of a category header, a category body, and a category trailer. A category header shall consist of the character string naming of the category, beginning with the characters LC_. The category trailer shall consist of the string END, 1 followed by one or more s and the string used in the corresponding 1 category header. The category body shall consist of one or more lines of text. Each line shall contain an identifier, optionally followed by one or more operands. Identifiers shall be either keywords, identifying a particular locale element, or collating elements. In addition to the keywords defined in this standard, the source can contain implementation-defined keywords. Each keyword within a locale shall have a unique name (i.e., two categories cannot have a commonly-named keyword); no keyword shall start with the characters LC_. Identifiers shall be separated from the operands by one or more s. Operands shall be characters, collating elements, or strings of characters. Strings shall be enclosed in double-quotes. Literal 1 double-quotes within strings shall be preceded by the <_e_s_c_a_p_e _c_h_a_r_a_c_t_e_r>, 1 described below. When a keyword is followed by more than one operand, 1 the operands shall be separated by semicolons; s shall be allowed before and/or after a semicolon. The first category header in the file can be preceded by a line modifying the comment character. It shall have the following format, starting in column 1: Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 72 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 "comment_char %c\n", <_c_o_m_m_e_n_t _c_h_a_r_a_c_t_e_r> The comment character shall default to the number-sign (#). Blank lines and lines containing the <_c_o_m_m_e_n_t _c_h_a_r> in the first position shall be ignored. The first category header in the file can be preceded by a line modifying the escape character to be used in the file. It shall have the following format, starting in column 1: "escape_char %c\n", <_e_s_c_a_p_e _c_h_a_r_a_c_t_e_r> The escape character shall default to backslash, which is the character used in all examples shown in this standard. A line can be continued by placing an escape character as the last character on the line; this continuation character shall be discarded 1 from the input. Although the implementation need not accept any one 1 portion of a continued line with a length exceeding {LINE_MAX} bytes, it 1 shall place no limits on the accumulated length of the continued line. 1 Comment lines shall not be continued on a subsequent line using an 1 escaped . Individual characters, characters in strings, and collating elements 2 shall be represented using symbolic names, as defined below. In 2 addition, characters can be represented using the characters themselves, 2 or as octal, hexadecimal, or decimal constants. When nonsymbolic 2 notation is used, the resultant locale definitions need not be portable 2 between systems. The left angle bracket (<) is a reserved symbol, 2 denoting the start of a symbolic name; when used to represent itself it 2 shall be preceded by the escape character. The following rules apply to 2 character representation: 2 (1) A character can be represented via a symbolic name, enclosed 2 within angle brackets (< and >). The symbolic name, including 2 the angle brackets, shall exactly match a symbolic name defined 2 in the charmap file specified via the localedef -f option, and 2 shall be replaced by a character value determined from the value 2 associated with the symbolic name in the charmap file. The use 2 of a symbolic name not found in the _c_h_a_r_m_a_p file shall 1 constitute an error, unless the category is LC_CTYPE or LC_COLLATE, in which case it shall constitute a warning condition (see localedef in 4.35 for a description of action resulting from errors and warnings). The specification of a symbolic name in a collating-element or collating-symbol clause that duplicates a symbolic name in the charmap file (if present) is an error. Use of the escape character or a right angle bracket within a symbolic name shall be invalid unless the character is preceded by the escape character. Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 73 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX _E_x_a_m_p_l_e: ; "" (2) A character can be represented by the character itself, in which 2 case the value of the character is implementation defined. 2 Within a string, the double-quote character, the escape 2 character, and the right angle bracket character shall be 2 escaped (preceded by the escape character) to be interpreted as 2 the character itself. Outside strings, the characters 2 , ; < > _e_s_c_a_p_e__c_h_a_r 2 shall be escaped to be interpreted as the character itself. 2 _E_x_a_m_p_l_e: c B "May" (3) A character can be represented as an octal constant. An octal 2 constant shall be specified as the escape character followed by 1 two or more octal digits. Each constant shall represent a byte 1 value. Multibyte characters can be represented by concatenated constants. _E_x_a_m_p_l_e: \143;\347;\143\150 "\115\141\171" (4) A character can be represented as a hexadecimal constant. A 2 hexadecimal constant shall be specified as the escape character 2 followed by an x followed by two or more hexadecimal digits. 1 Each constant shall represent a byte value. Multibyte characters can be represented by concatenated constants. _E_x_a_m_p_l_e: \x63;\xe7;\x63\x68 "\x4d\x61\x79" (5) A character can be represented as a decimal constant. A decimal 2 constant shall be specified as the escape character followed by 2 a d followed by two or more decimal digits. Each constant shall 1 represent a byte value. Multibyte values can be represented by concatenated constants. _E_x_a_m_p_l_e: \d99;\d231;\d99\d104 "\d77\d97\d121" Implementations may accept single-digit octal, decimal, or hexadecimal 1 constants following the escape character. Only characters existing in 1 the character set for which the locale definition is created shall be 1 specified, whether using symbolic names, the characters themselves, or 1 octal, decimal, or hexadecimal constants. If a charmap file is present, 2 only characters defined in the charmap can be specified using octal, 2 decimal, or hexadecimal constants. Symbolic names not present in the 2 charmap file can be specified and shall be ignored, as specified under 2 item (1) above. 2 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 74 2 Terminology and General Requirements Part 2: SHELL AND UTILITIES P1003.2/D11.2 BEGIN_RATIONALE 2 2.5.2.0.1 Locale Definition Rationale. (_T_h_i_s _s_u_b_c_l_a_u_s_e _i_s _n_o_t _a _p_a_r_t _o_f _P_1_0_0_3._2) The decision to separate the file format from the localedef utility 1 description was only partially editorial. Implementations may provide 1 other interfaces than localedef. Requirements on ``the utility,'' mostly 1 concerning error messages, are described in this way because they are 1 meant to affect the other interfaces implementations may provide as well 1 as localedef. (This is similar to the philosophy used by POSIX.1 {8} 1 where the descriptions of the tar and cpio file formats impose 1 requirements on any utilities processing them.) 1 The text about {POSIX2_LOCALEDEF} does not mean that internationalization is optional; only that the functionality of the localedef utility is. Regular expressions, for instance, must still be able to recognize e.g., character class expressions such as [[:alpha:]]. A possible analogy is with an applications development environment: while all conforming implementations must be capable of executing applications, not all need to have the development environment installed. The assumption is that the capability to modify the behavior of utilities (and applications) via locale settings must be supported. If the localedef utility is not present, then the only choice is to select an existing (presumably implementation-documented) locale. An implementation could, for example, chose to support only the POSIX Locale, which would in effect limit the amount of changes from historical implementations quite drastically. The localedef utility is still required, but would always terminate with an exit code indicating that no locale could be created. Supported locales must be documented using the syntax defined in 2.5. (This ensures that users can accurately determine what capabilities are provided. If the implementation decides to provide additional capabilities to the ones in 2.5, that is already provided for.) If the option is present (i.e., locales can be created), then the localedef utility must be capable of creating locales based on the syntax and rules defined in 2.5. This does not mean that the implementation cannot also provide alternate means for creating locales. The octal, decimal, and hexadecimal notations are the same employed by 1 the charmap facility (see 2.4.1). To avoid confusion between an octal 1 constant and a backreference, the octal, hexadecimal, and decimal 1 constants must contain at least two digits. As single-digit constants 1 are relatively rare, this should not impose any significant hardship. 1 Each of the constants includes ``two or more'' digits to account for 1 systems in which the byte size is larger than eight bits. For example, a 1 Unicode system that has defined 16-bit bytes may require six octal, four 1 Copyright c 1991 IEEE. All rights reserved. This is an unapproved IEEE Standards Draft, subject to change. 2.5 Locale 75 P1003.2/D11.2 INFORMATION TECHNOLOGY--POSIX hexadecimal, and five decimal digits. 1 This standard is intended as an international (ISO/IEC) standard as well 1 as an IEEE standard, and must therefore follow the ISO/IEC guidelines. 1 One such rule is that characters outside the invariant part of 1 ISO/IEC 646 {1} should not be used in portable specifications. The 1 backslash character is not in the invariant part; the number-sign is, but 1 with multiple representations: as a number-sign and as a pound sign. As 1 far as general usage of these symbols, they are covered by the 1 ``grandfather clause,'' but for newly defined interfaces, ISO has 1 requested that POSIX provides alternate representations. Consequently, 1 while the default escape character remains the backslash, and the default 1 comment character is the number-sign, implementations are required to 1 recognize alternative representations, identified in the applicable 1 source file via the escape_char and comment_char keywords. 1 END_RATIONALE 1 2.5.2.1 LC_CTYPE Table 2-5 - LC_CTYPE Category Definition in the POSIX Locale __________________________________________________________________________________________________________________________________________________ LC_CTYPE # The following is the POSIX Locale LC_CTYPE. # "alpha" is by default "upper" and "lower" # "alnum" is by definition "alpha" and "digit" # "print" is by default "alnum", "punct" and the character # "graph" is by default "alnum" and "punct" # upper ;;;;;;;;;;;;;\ ;;