1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
|
Independent Submission J. Kunze
Request for Comments: 8493 California Digital Library
Category: Informational J. Littman
ISSN: 2070-1721 Stanford Libraries
E. Madden
Library of Congress
J. Scancella
C. Adams
Library of Congress
October 2018
The BagIt File Packaging Format (V1.0)
Abstract
This document describes BagIt, a set of hierarchical file layout
conventions for storage and transfer of arbitrary digital content. A
"bag" has just enough structure to enclose descriptive metadata
"tags" and a file "payload" but does not require knowledge of the
payload's internal semantics. This BagIt format is suitable for
reliable storage and transfer.
Status of This Memo
This document is not an Internet Standards Track specification; it is
published for informational purposes.
This is a contribution to the RFC Series, independently of any other
RFC stream. The RFC Editor has chosen to publish this document at
its discretion and makes no statement about its value for
implementation or deployment. Documents approved for publication by
the RFC Editor are not candidates for any level of Internet Standard;
see Section 2 of RFC 7841.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
https://www.rfc-editor.org/info/rfc8493.
Kunze, et al. Informational [Page 1]
^L
RFC 8493 BagIt October 2018
Copyright Notice
Copyright (c) 2018 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(https://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Kunze, et al. Informational [Page 2]
^L
RFC 8493 BagIt October 2018
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2. Requirements . . . . . . . . . . . . . . . . . . . . . . 4
1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . 5
2. Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1. Required Elements . . . . . . . . . . . . . . . . . . . . 6
2.1.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . 6
2.1.2. Payload Directory: data/ . . . . . . . . . . . . . . 7
2.1.3. Payload Manifest: manifest-algorithm.txt . . . . . . 7
2.2. Optional Elements . . . . . . . . . . . . . . . . . . . . 8
2.2.1. Tag Manifest: tagmanifest-algorithm.txt . . . . . . . 8
2.2.2. Bag Metadata: bag-info.txt . . . . . . . . . . . . . 9
2.2.3. Fetch File: fetch.txt . . . . . . . . . . . . . . . . 12
2.2.4. Other Tag Files . . . . . . . . . . . . . . . . . . . 12
2.3. Text Tag File Format . . . . . . . . . . . . . . . . . . 13
2.4. Bag Checksum Algorithms . . . . . . . . . . . . . . . . . 13
3. Complete and Valid Bags . . . . . . . . . . . . . . . . . . . 14
4. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1. Example of a Basic Bag . . . . . . . . . . . . . . . . . 15
4.2. Example Bag Using fetch.txt . . . . . . . . . . . . . . . 16
5. Security Considerations . . . . . . . . . . . . . . . . . . . 16
5.1. Special Directory Characters . . . . . . . . . . . . . . 16
5.2. Control of URLs in fetch.txt . . . . . . . . . . . . . . 17
5.3. File Sizes in fetch.txt . . . . . . . . . . . . . . . . . 17
5.4. Attacks on Payload File Content . . . . . . . . . . . . . 17
6. Practical Considerations (Non-normative) . . . . . . . . . . 17
6.1. Interoperability . . . . . . . . . . . . . . . . . . . . 17
6.1.1. Filename Normalization . . . . . . . . . . . . . . . 18
6.1.2. Windows and Unix File Naming . . . . . . . . . . . . 18
6.1.3. Legacy Checksum Tools . . . . . . . . . . . . . . . . 18
7. Augmented Backus-Naur Form (Non-normative) . . . . . . . . . 21
7.1. Bag Declaration: bagit.txt . . . . . . . . . . . . . . . 21
7.2. Payload Manifest: manifest-algorithm.txt . . . . . . . . 21
7.3. Bag Metadata: bag-info.txt . . . . . . . . . . . . . . . 22
7.4. Fetch File: fetch.txt . . . . . . . . . . . . . . . . . . 22
8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 22
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 22
9.1. Normative References . . . . . . . . . . . . . . . . . . 22
9.2. Informative References . . . . . . . . . . . . . . . . . 23
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . 24
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 25
Kunze, et al. Informational [Page 3]
^L
RFC 8493 BagIt October 2018
1. Introduction
1.1. Purpose
BagIt is a set of hierarchical file layout conventions designed to
support storage and transfer of arbitrary digital content. A "bag"
consists of a directory containing the payload files and other
accompanying metadata files known as "tag" files. The "tags" are
metadata files intended to facilitate and document the storage and
transfer of the bag. Processing a bag does not require any
understanding of the payload file contents, and the payload files can
be accessed without processing the BagIt metadata.
The name, BagIt, is inspired by the "enclose and deposit" method
[ENCDEP], sometimes referred to as "bag it and tag it". BagIt
differs from serialized archival formats such as MIME, TAR, or ZIP in
two general areas:
1. Strong integrity assurances. The format supports cryptographic-
quality hash algorithms (see Section 2.4) and allows for in-place
upgrades to add additional manifests using stronger algorithms
without breaking backwards compatibility. This provides high
levels of confidence against data corruption, but it is not
designed to be secure against active attacks.
2. Direct file access. Because BagIt specifies an actual filesystem
hierarchy rather than a serialized representation of one, files
can be accessed using standard operating system utilities,
implementations do not need to process a potentially large
archival file to extract a subset of data, and the format imposes
no size limits for either individual files or a bag.
BagIt is widely used for preserving digital assets originating from
different domains. Organizations involved in digital preservation
with BagIt include the Library of Congress, Dryad Data Repository,
NSF DataONE, and the Rockefeller Archive Center. Software
implementations are available for many languages, including Python,
Ruby, Java, Perl, and PHP. It is also used in the libraries of many
universities, such as Cornell, Purdue, Stanford, Ghent University,
New York University, and the University of California.
1.2. Requirements
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all
capitals, as shown here.
Kunze, et al. Informational [Page 4]
^L
RFC 8493 BagIt October 2018
Implementers are strongly encouraged to review the interoperability
considerations described in Section 6.1.
1.3. Terminology
The following terms have precise definitions as used in this
document:
bag: A set of opaque files contained within the structure defined by
this document.
bag declaration: The file required to be in all bags conforming to
this document. Contains values necessary to process the rest of a
bag. See Section 2.1.1.
bag checksum algorithm: The name of a cryptographic checksum
algorithm that has been normalized for use in a manifest or tag
manifest file name (e.g., "sha512") as described in Section 2.4.
manifest: A tag file that maps filepaths to checksums. A manifest
can be a payload manifest (see Section 2.1.3) or a tag manifest
(see Section 2.2.1).
payload: The data encapsulated by the bag as a set of named files,
which may be organized in subdirectories. The contents of the
payload files are opaque to this document, and, with respect to
BagIt processing, are always considered as sequences of
uninterpreted octets. See Section 2.1.2.
tag directory: A directory that contains one or more tag files.
tag file: A file that contains metadata about the bag or its
payload. This document defines the standard BagIt tag files: the
bag declaration in "bagit.txt" (see Section 2.1.1), payload
manifests (see Section 2.1.3), tag manifests (see Section 2.2.1),
bag metadata in "bag-info.txt" (see Section 2.2.2), and remote
payload in "fetch.txt" (see Section 2.2.3). This document also
allows other arbitrary tag files as described in Section 2.2.4.
complete: A bag that contains every element required by this
document, every payload file listed in a manifest, and any
optional files that are listed in a tag manifest. See Section 3.
valid: A complete bag where every checksum in every manifest has
been successfully verified against the corresponding file.
Kunze, et al. Informational [Page 5]
^L
RFC 8493 BagIt October 2018
2. Structure
A bag MUST consist of a base directory containing the following:
1. a set of required and optional tag files (see Section 2.2);
2. a subdirectory named "data", called the payload directory (see
Section 2.1.2); and
3. a set of optional tag directories.
The tag files in the base directory consist of one or more files
named "manifest-_algorithm_.txt" (see Sections 2.1.3 and 2.4), a file
named "bagit.txt" (see Section 2.1.1), and zero or more additional
tag files (see Section 2.2). The tag files and directories are in
arbitrary file hierarchies and MAY have any name that is not reserved
for a file or directory in this document.
The base directory can have any name, as illustrated by the figure
below.
<base directory>/
|
+-- bagit.txt
|
+-- manifest-<algorithm>.txt
|
+-- [additional tag files]
|
+-- data/
| |
| +-- [payload files]
|
+-- [tag directories]/
|
+-- [tag files]
2.1. Required Elements
2.1.1. Bag Declaration: bagit.txt
The "bagit.txt" tag file MUST consist of exactly two lines in this
order:
BagIt-Version: M.N
Tag-File-Character-Encoding: ENCODING
Kunze, et al. Informational [Page 6]
^L
RFC 8493 BagIt October 2018
_M.N_ identifies the BagIt major (M) and minor (N) version numbers.
_ENCODING_ identifies the character set encoding used by the
remaining tag files. _ENCODING_ SHOULD be "UTF-8", but for backwards
compatibility it MAY be any other encoding registered in
[cs-registry]. The bag declaration itself MUST be encoded in UTF-8
and MUST NOT contain a Byte Order Mark (BOM) [RFC3629].
The number for this version of BagIt is "1.0".
2.1.2. Payload Directory: data/
The base directory MUST contain a subdirectory named "data".
The payload directory contains the arbitrary digital content within
the bag. The files under the payload directory are called payload
files, or the payload. Each payload file is treated as an opaque
octet stream when verifying file correctness. Payload files MAY be
organized in arbitrary subdirectory structures within the payload
directory; however, for the purpose of this document, such
subdirectory structures and filenames have no given meaning.
2.1.3. Payload Manifest: manifest-algorithm.txt
A payload manifest file provides a complete listing of each payload
file name along with a corresponding checksum to permit data
integrity checking. A bag can have more than one payload manifest,
with each using a different checksum algorithm. Manifest entries
MUST satisfy the following constraints:
o Every bag MUST contain at least one payload manifest file and MAY
contain more than one.
o Every payload manifest MUST list every payload file name exactly
once.
o A payload manifest file MUST have a name of the form "manifest-
_algorithm_.txt", where _algorithm_ is a string specifying the
checksum algorithm used by that manifest as described in
Section 2.4.
Example payload manifest filenames:
manifest-sha256.txt
manifest-sha512.txt
Kunze, et al. Informational [Page 7]
^L
RFC 8493 BagIt October 2018
Each line of a payload manifest file MUST be of the form
checksum filepath
where _filepath_ is the pathname of a file relative to the base
directory, and _checksum_ is a hex-encoded checksum calculated by
applying _algorithm_ over the file.
o The hex-encoded checksum MAY use uppercase and/or lowercase
letters.
o The slash character ('/') MUST be used as a path separator in
_filepath_.
o One or more linear whitespace characters (spaces or tabs) MUST
separate _checksum_ from _filepath_.
o There is no limitation on the length of a pathname.
o The payload manifest MUST NOT reference files outside the payload
directory.
o If a _filepath_ includes a Line Feed (LF), a Carriage Return (CR),
a Carriage-Return Line Feed (CRLF), or a percent sign (%), those
characters (and only those) MUST be percent-encoded following
[RFC3986].
A manifest MUST NOT reference directories. Bag creators who wish to
create an otherwise empty directory have typically done so by
creating an empty placeholder file with a name such as ".keep".
2.2. Optional Elements
2.2.1. Tag Manifest: tagmanifest-algorithm.txt
A tag manifest is a tag file that lists other tag files and checksums
for those tag files generated using a particular bag checksum
algorithm.
A bag MAY contain one or more tag manifests, in which case each tag
manifest SHOULD list the same set of tag files.
Each tag manifest MUST list every payload manifest. Each tag
manifest MUST NOT list any tag manifests but SHOULD list the
remaining tag files present in the bag.
Kunze, et al. Informational [Page 8]
^L
RFC 8493 BagIt October 2018
A tag manifest file MUST have a name of the form "tagmanifest-
_algorithm_.txt", where _algorithm_ is a string following the format
described in Section 2.4 that specifies the bag checksum algorithm
used in that manifest.
Tag manifests SHOULD use the same algorithms as the payload manifests
that are present in the bag.
Example tag manifest filenames:
tagmanifest-sha256.txt
tagmanifest-sha512.txt
A tag manifest file has the same form as the payload manifest file
described in Section 2.1.3 but MUST NOT list any payload files. As a
result, no _filepath_ listed in a tag manifest begins "data/".
2.2.2. Bag Metadata: bag-info.txt
The "bag-info.txt" file is a tag file that contains metadata elements
describing the bag and the payload. The metadata elements contained
in the "bag-info.txt" file are intended primarily for human use. All
metadata elements are OPTIONAL and MAY be repeated. Because "bag-
info.txt" is intended for human reading and editing, ordering MAY be
significant and the ordering of metadata elements MUST be preserved.
A metadata element MUST consist of a label, a colon ":", a single
linear whitespace character (space or tab), and a value that is
terminated with an LF, a CR, or a CRLF.
The label MUST NOT contain a colon (:), LF, or CR. The label MAY
contain linear whitespace characters but MUST NOT start or end with
whitespace.
It is RECOMMENDED that lines not exceed 79 characters in length.
Long values MAY be continued onto the next line by inserting a LF,
CR, or CRLF, and then indenting the next line with one or more linear
white space characters (spaces or tabs). Except for linebreaks, such
padding does not form part of the value.
Implementations wishing to support previous BagIt versions MUST
accept multiple linear whitespace characters before and after the
colon when the bag version is earlier than 1.0; such whitespace does
not form part of the label or value.
Kunze, et al. Informational [Page 9]
^L
RFC 8493 BagIt October 2018
The following are reserved metadata elements. The use of these
reserved metadata elements is OPTIONAL but encouraged. Reserved
metadata element names are case insensitive. Except where indicated
otherwise, these metadata element names MAY be repeated to capture
multiple values.
Source-Organization: Organization transferring the content.
Organization-Address: Mailing address of the source organization.
Contact-Name: Person at the source organization who is responsible
for the content transfer.
Contact-Phone: International format telephone number of person or
position responsible.
Contact-Email: Fully qualified email address of person or position
responsible.
External-Description: A brief explanation of the contents and
provenance.
Bagging-Date: Date (YYYY-MM-DD) that the content was prepared for
transfer. This metadata element SHOULD NOT be repeated.
External-Identifier: A sender-supplied identifier for the bag.
Bag-Size: The size or approximate size of the bag being transferred,
followed by an abbreviation such as MB (megabytes), GB
(gigabytes), or TB (terabytes): for example, 42600 MB, 42.6 GB, or
.043 TB. Compared to Payload-Oxum (described next), Bag-Size is
intended for human consumption. This metadata element SHOULD NOT
be repeated.
Payload-Oxum: The "octetstream sum" of the payload, which is
intended for the purpose of quickly detecting incomplete bags
before performing checksum validation. This is strictly an
optimization, and implementations MUST perform the standard
checksum validation process before proclaiming a bag to be valid.
This element MUST NOT be present more than once and, if present,
MUST be in the form "_OctetCount_._StreamCount_", where
_OctetCount_ is the total number of octets (8-bit bytes) across
all payload file content and _StreamCount_ is the total number of
payload files. This metadata element MUST NOT be repeated.
Kunze, et al. Informational [Page 10]
^L
RFC 8493 BagIt October 2018
Bag-Group-Identifier: A sender-supplied identifier for the set, if
any, of bags to which it logically belongs. This identifier
SHOULD be unique across the sender's content, and if it is
recognizable as belonging to a globally unique scheme, the
receiver SHOULD make an effort to honor the reference to it. This
metadata element SHOULD NOT be repeated.
Bag-Count: Two numbers separated by "of", in particular, "N of T",
where T is the total number of bags in a group of bags and N is
the ordinal number within the group. If T is not known, specify
it as "?" (question mark): for example, 1 of 2, 4 of 4, 3 of ?, 89
of 145. This metadata element SHOULD NOT be repeated. If this
metadata element is present, it is RECOMMENDED to also include the
Bag-Group-Identifier element.
Internal-Sender-Identifier: An alternate sender-specific identifier
for the content and/or bag.
Internal-Sender-Description: A sender-local explanation of the
contents and provenance.
In addition to these metadata elements, other arbitrary metadata
elements MAY also be present.
An example of "bag-info.txt" file is as follows:
Source-Organization: FOO University
Organization-Address: 1 Main St., Cupertino, California, 11111
Contact-Name: Jane Doe
Contact-Phone: +1 111-111-1111
Contact-Email: example@example.com
External-Description: Uncompressed greyscale TIFF images from the
FOO papers colle...
Bagging-Date: 2008-01-15
External-Identifier: university_foo_001
Payload-Oxum: 279164409832.1198
Bag-Group-Identifier: university_foo
Bag-Count: 1 of 15
Internal-Sender-Identifier: /storage/images/foo
Internal-Sender-Description: Uncompressed greyscale TIFFs created
from microfilm and are...
Kunze, et al. Informational [Page 11]
^L
RFC 8493 BagIt October 2018
2.2.3. Fetch File: fetch.txt
For reasons of efficiency, a bag MAY be sent with a list of files to
be fetched and added to the payload before it can meaningfully be
checked for completeness. The fetch file allows a bag to be
transmitted with "holes" in it, which can be practical for several
reasons. For example, it obviates the need for the sender to stage a
large serialized copy of the content while the bag is transferred to
the receiver. Also, this method allows a sender to construct a bag
from components that are either a subset of logically related
components (e.g., the localized logical object could be much larger
than what is intended for export) or assembled from logically
distributed sources (e.g., the object components for export are not
stored locally under one filesystem tree). An OPTIONAL tag file,
called the fetch file, contains such a list.
The fetch file MUST be named "fetch.txt". Every file listed in the
fetch file MUST be listed in every payload manifest. A fetch file
MUST NOT list any tag files.
Each line of a fetch file MUST be of the form
url length filepath
where _url_ identifies the file to be fetched and MUST be an absolute
URI as defined in [RFC3986], _length_ is the number of octets in the
file (or "-", to leave it unspecified), and _filepath_ identifies the
corresponding payload file, relative to the base directory.
The slash character ('/') MUST be used as a path separator in
_filepath_. One or more linear whitespace characters (spaces or tabs)
MUST separate these three values, and any such characters in the
_url_ MUST be percent-encoded [RFC3986]. If _filename_ includes an
LF, a CR, a CRLF, or a percent sign (%), those characters (and only
those) MUST be percent-encoded as described in [RFC3986]. There is
no limitation on the length of any of the fields in the fetch file.
2.2.4. Other Tag Files
A bag MAY contain other tag files that are not defined by this
document. Implementations MUST perform standard checksum validation
on any tag file that is listed in a tag manifest but MUST otherwise
ignore their contents.
Kunze, et al. Informational [Page 12]
^L
RFC 8493 BagIt October 2018
2.3. Text Tag File Format
All tag files specifically described in this document MUST adhere to
the text tag file format described below. Other tag files MAY adhere
to the text tag file format described below.
Text tag files are line oriented, and each line MUST be terminated by
an LF, a CR, or a CRLF. It is RECOMMENDED that the last line in a
tag file also end with LF, CR, or CRLF. Text tag file names MUST end
in the extension ".txt".
In all text tag files except for the bag declaration file, text MUST
use the character encoding specified in the "bagit.txt" bag
declaration file. Text tag files except for the bag declaration file
MAY include a Byte Order Mark (BOM) only if the specified encoding
requires it for proper decoding. In accordance with [RFC3629], when
"bagit.txt" specifies UTF-8, the tag files MUST NOT begin with a BOM.
See Section 2.1.1.
The use of UTF-8 for text tag files is strongly RECOMMENDED. A
future version of BagIt may disallow encodings other than UTF-8.
2.4. Bag Checksum Algorithms
The payload manifest and tag manifest permit validating the integrity
of the payload and tag files in a bag produced by the checksum
algorithms. Checksum values MUST be encoded so as to conform to the
manifest format specified in Section 2.1.3. However, the internal
details of a checksum are outside the scope of this document.
To avoid future ambiguity, the checksum algorithm SHOULD be
registered in IANA's "Named Information Hash Algorithm Registry"
[ni-registry] according to [RFC6920] but MAY, for backwards
compatibility, also be MD5 [RFC1321] or SHA-1 [RFC3174].
The name of the checksum algorithm MUST be normalized for use in the
manifest's filename by lowercasing the common name of the algorithm
and removing all non-alphanumeric characters. Following is a partial
list that maps common algorithm names to normalized names:
o MD5: md5
o SHA-1: sha1
o sha-256: sha256
o sha-512: sha512
Kunze, et al. Informational [Page 13]
^L
RFC 8493 BagIt October 2018
Starting with BagIt 1.0, bag creation and validation tools MUST
support the SHA-256 and SHA-512 algorithms [RFC6234] and SHOULD
enable SHA-512 by default when creating new bags. For backwards
compatibility, implementers SHOULD support MD5 [RFC1321] and SHA-1
[RFC3174]. Implementers are encouraged to simplify the process of
adding additional manifests using new algorithms to streamline the
process of in-place upgrades.
3. Complete and Valid Bags
A _complete_ bag MUST meet the following requirements:
1. Every required element MUST be present (see Section 2.1).
2. Every file listed in every tag manifest MUST be present.
3. Every file listed in every payload manifest MUST be present.
4. For BagIt 1.0, every payload file MUST be listed in every payload
manifest. Note that older versions of BagIt allowed payload
files to be listed in just one of the manifests.
5. Every element present MUST conform to BagIt 1.0.
A _valid_ bag MUST meet the following requirements:
1. The bag MUST be _complete_.
2. Every checksum in every payload manifest and tag manifest has
been successfully verified against the contents of the
corresponding file.
Kunze, et al. Informational [Page 14]
^L
RFC 8493 BagIt October 2018
4. Examples
4.1. Example of a Basic Bag
This is the layout of a basic bag containing an image and a companion
Optical Character Recognition (OCR) file. Lines of file content are
shown with added parentheses to indicate each complete line. For
brevity, this example uses MD5 rather than the recommended SHA-512.
myfirstbag/
|
| manifest-md5.txt
| (49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png)
| (408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt)
|
| bagit.txt
| (BagIt-version: 1.0 )
| (Tag-File-Character-Encoding: UTF-8 )
|
\--- data/
|
| 27613-h/images/q172.png
| (... image bytes ... )
|
| 27613-h/images/q172.txt
| (... OCR text ... )
....
Kunze, et al. Informational [Page 15]
^L
RFC 8493 BagIt October 2018
4.2. Example Bag Using fetch.txt
This is the layout of a bag that expects the receiver to download the
files listed in the payload manifests prior to validation. Lines of
file content are shown with added parentheses to indicate each
complete line. For brevity, this example uses MD5 rather than the
recommended SHA-512.
highsmith-tahoe/
|
| manifest-md5.txt
| (102b0e6effe208ef9b29864946de9e22 data/23364a.tif )
|
| fetch.txt
| (https://cdn.loc.gov/master/pnp/highsm/23300/23364a.tif
| 216951362 data/23364a.tif )
|
| bagit.txt
| (BagIt-version: 1.0 )
| (Tag-File-Character-Encoding: UTF-8 )
|
| bag-info.txt
| (Internal-Sender-Description: Download link found at )
| ( https://www.loc.gov/resource/highsm.23364/ )
5. Security Considerations
5.1. Special Directory Characters
The paths specified in the payload manifests, tag manifests, and
fetch files do not prohibit special directory characters that have
special meaning on some operating systems. Implementers MUST ensure
that files outside the bag directory structure are not accessed when
reading or writing files based on paths specified in a bag.
All implementations SHOULD have a test suite to guard against special
directory characters.
For example, a maliciously crafted "tagmanifest-sha512.txt" file
might contain entries that begin with a path character such as "/",
"..", or a "~username" home directory reference in an attempt to
cause a naive implementation to leak or overwrite targeted files on a
POSIX operating system.
Windows implementations SHOULD test their implementations to ensure
that safety checks prevent use of drive letters and the less commonly
used namespace sequences (e.g., "\\?\C:\...") described in [MSFNAM].
Kunze, et al. Informational [Page 16]
^L
RFC 8493 BagIt October 2018
To assist implementers, the Library of Congress conformance suite
[LC-CONFORMANCE-SUITE] has some tests for invalid bags that are
expected to fail on POSIX or Windows clients.
5.2. Control of URLs in fetch.txt
Implementers of tools that complete bags by retrieving URLs listed in
a fetch file need to be aware that some of those URLs might point to
hosts, intentionally or unintentionally, that are not under control
of the bag's sender. Moreover, older checksum algorithms, even if
reasonable for detecting corruption during transit, may not offer
strong cryptographic protection against intentional spoofing.
5.3. File Sizes in fetch.txt
The size of files, as optionally reported in the fetch file, cannot
be guaranteed to match the actual file size to be downloaded.
Implementers SHOULD take steps to monitor and abort transfer when the
received file size exceeds the file size reported in the fetch file.
Implementers SHOULD NOT use the file size in the fetch file for
critical resource allocation, such as buffer sizing or storage
requisitioning.
5.4. Attacks on Payload File Content
The integrity assurance provided by manifests is designed to provide
high levels of confidence against data corruption but is not designed
to be secure against active attacks. Organizations that need to
secure bags against such threats SHOULD agree on additional measures,
such as digital signatures, that are out of scope for this
specification.
6. Practical Considerations (Non-normative)
6.1. Interoperability
This section lists practical considerations for implementers and
users. None of the points below are required, but they are
recommended for general-purpose usage.
Upon discovering errors in bags, an implementation is free to take
action (for example, logging or reporting) in an application-specific
manner. This document does not mandate any particular action.
The Library of Congress conformance suite [LC-CONFORMANCE-SUITE] is
provided as a public resource to test new implementations for
compatibility and error handling.
Kunze, et al. Informational [Page 17]
^L
RFC 8493 BagIt October 2018
6.1.1. Filename Normalization
This section provides background information on various challenges
caused by differences in how operating systems, filesystems, and
common tools handle filenames. This section is followed by a list of
recommendations for implementers in Section 6.1.1.3.
6.1.1.1. Case Sensitivity
There are three challenges for interoperability related to filename
case:
o Filesystems such as File Allocation Table (FAT) or Extended File
Allocation Table (EXFAT) always convert filenames to uppercase:
"example.txt" will be stored as "EXAMPLE.TXT".
o Many Unix filesystems save filenames exactly as provided, which
allows multiple files that differ only in case: "example.txt" and
"Example.txt" are separate files.
o New Technology File System (NTFS) and Apple's Hierarchical File
System (HFS) Plus usually preserve case when storing files but are
case insensitive when retrieving them. A file saved as
"Example.txt" will be retrieved by that name but will also be
retrieved as "EXAMPLE.TXT", "example.txt", etc.
6.1.1.2. Unicode Normalization
The Unicode specification has common cases where different character
sequences produce the same human-meaningful text. These are referred
to as "canonically equivalent" and the Unicode specification defines
different normalization forms - see [UNICODE-TR15] for the full
details.
The example below shows the common surname "Nunez" normalized in
different forms.
Normalization Form D (Decomposition)
Char UTF8 Hex Name
----------------------------------------------
N 4e LATIN CAPITAL LETTER N
u 75 LATIN SMALL LETTER U
\u0301 cc81 COMBINING ACUTE ACCENT
n 6e LATIN SMALL LETTER N
\u0303 cc83 COMBINING TILDE
e 65 LATIN SMALL LETTER E
z 7a LATIN SMALL LETTER Z
Kunze, et al. Informational [Page 18]
^L
RFC 8493 BagIt October 2018
Normalization Form C (Canonical Composition)
Char UTF8 Hex Name
----------------------------------------------
N 4e LATIN CAPITAL LETTER N
u c3ba LATIN SMALL LETTER U WITH ACUTE
n c3b1 LATIN SMALL LETTER N WITH TILDE
e 65 LATIN SMALL LETTER E
z 7a LATIN SMALL LETTER Z
Unicode normalization is relevant to BagIt implementors because
different systems have different standards for normalization:
o Apple's HFS Plus filesystem always normalizes filenames to a fully
decomposed form based on the Unicode 2.0 specification (see
[TN1150]).
o Windows treats filenames as opaque character sequences (see
[MSFNAM]) and will store and return the encoded bytes exactly as
provided.
o Linux and other common Unix systems are generally similar to
Windows in storing and returning opaque byte streams, but this
behavior is technically dependent on the filesystem.
o Utilities used for file management, transfer, and archiving may
ignore this issue, apply an arbitrary normalization form, or allow
the user to control how normalization is applied.
In practice, this means that the encoded filename stored in a
manifest may fail a simple file existence check because the
filename's normalization was changed at some point after the manifest
was written. This situation is very confusing for users because the
filenames are visually indistinguishable, and the "missing" file is
obviously present in the payload directory.
6.1.1.3. Recommendations
o Implementations SHOULD discourage the creation of bags containing
files that differ only in case.
o Implementations SHOULD prevent the creation of bags containing
files that differ only in normalization form.
o BagIt implementations SHOULD tolerate differences in normalization
form by comparing both the list of filesystem and manifest names
after applying the same normalization form to both.
Kunze, et al. Informational [Page 19]
^L
RFC 8493 BagIt October 2018
o Implementations SHOULD issue a warning when multiple manifests are
present that differ only in case or normalization form.
6.1.2. Windows and Unix File Naming
As specified above, only the Unix-based path separator ('/') may be
used inside filenames listed in BagIt manifest and fetch.txt files.
When bags are exchanged between Windows and Unix platforms, the path
separator SHOULD be translated as needed. Receivers of bags on
physical media SHOULD be prepared for filesystems created under
either Windows or Unix. Besides the fundamental difference between
path separators ('\' and '/'), generally, Windows filesystems have
more limitations than Unix filesystems.
Windows path names have a maximum of 255 characters, and none of
these characters may be used in a path component:
< > : " / | ? *
Windows also reserves the following names, with or without a file
extension:
CON, PRN, AUX, NUL
COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9
LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9
See [MSFNAM] for more information and possible alternatives.
6.1.3. Legacy Checksum Tools
Some bags have been manually assembled using checksum utilities such
as those contained in the GNU Coreutils package (md5sum, sha1sum,
etc.), collectively referred to here as "md5sum". Implementers who
desire wide support of legacy content should be aware of some known
quirks of these tools.
md5sum can be run in "text mode", which causes it to normalize line
endings on some operating systems. On Unix-like systems, both modes
will usually produce the same results; on systems like Windows, they
can produce different results based on the file contents. The md5sum
output format has two characters between the checksum and the
filepath: the first is always a space, and the second is an asterisk
("*") for binary mode and a space for text mode.
A final note about md5sum-generated manifests is that, for a
_filepath_ containing a backslash ('\'), the manifest line will have
a backslash inserted in front of the _checksum_ and, under Windows,
the backslashes inside _filepath_ can be doubled.
Kunze, et al. Informational [Page 20]
^L
RFC 8493 BagIt October 2018
Implementers MAY wish to accept this format by ignoring a leading
asterisk or handling differences in line termination gracefully but,
if so, implementations MUST warn the user that the bag in question
will fail strict validation. In such cases, it is RECOMMENDED that
tools provide an easy option to update the bag with valid manifests.
7. Augmented Backus-Naur Form (Non-normative)
The Augmented Backus-Naur Form (ABNF) rules provided below are non-
normative. If there is a discrepancy between requirements in the
normative sections and the ABNF, the requirements in the normative
sections prevail. Some definitions use the core rules (e.g., DIGIT,
HEXDIG, etc) as defined in [RFC5234].
7.1. Bag Declaration: bagit.txt
bagit.txt ABNF rules:
bagit-txt = "BagIt-Version: " 1*DIGIT "." 1*DIGIT ending
"Tag-File-Character-Encoding: " encoding ending
encoding = 1*CHAR
ending = CR / LF / CRLF
7.2. Payload Manifest: manifest-algorithm.txt
Payload Manifest ABNF rules:
payload-manifest = 1*payload-manifest-line
payload-manifest-line = checksum 1*WSP filepath ending
checksum = 1*case-hexdig
case-hexdig = DIGIT / "A" / "a" / "B" / "b" / "C" / "c" /
"D" / "d" / "E"/ "e"/ "F" / "f"
filepath = "data/"
1*( unreserved / pct-encoded / sub-delims )
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / DQUOTE / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / "/"
pct-encoded = "%0D" / "%0d" / "%0A" / "%0a" / "%25"
ending = CR / LF / CRLF
Kunze, et al. Informational [Page 21]
^L
RFC 8493 BagIt October 2018
7.3. Bag Metadata: bag-info.txt
bag-info.txt ABNF rules:
metadata = 1*metadata-line
metadata-line = key ":" WSP value ending *(continuation ending)
key = 1*non-reserved
value = 1*non-reserved
continuation = WSP 1*non-reserved
non-reserved = VCHAR / WSP
; any valid character for the specific encoding
; except those that match "ending"
ending = CR / LF / CRLF
7.4. Fetch File: fetch.txt
fetch.txt ABNF rules:
fetch = 1*fetch-line
fetch-line = url 1*WSP length 1*WSP filepath ending
url = <absolute-URI, see [RFC3986], Section 4.3>
length = 1*DIGIT / "-"
filepath = ("data/"
1*( unreserved / pct-encoded / sub-delims ))
ending = CR / LF / CRLF
8. IANA Considerations
This document has no IANA actions.
9. References
9.1. Normative References
[cs-registry]
IANA, "Character Set",
<https://www.iana.org/assignments/character-sets>.
[ni-registry]
IANA, "Named Information Hash Algorithm",
<https://www.iana.org/assignments/named-information>.
[RFC1321] Rivest, R., "The MD5 Message-Digest Algorithm", RFC 1321,
DOI 10.17487/RFC1321, April 1992,
<https://www.rfc-editor.org/info/rfc1321>.
Kunze, et al. Informational [Page 22]
^L
RFC 8493 BagIt October 2018
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119,
DOI 10.17487/RFC2119, March 1997,
<https://www.rfc-editor.org/info/rfc2119>.
[RFC3174] Eastlake 3rd, D. and P. Jones, "US Secure Hash Algorithm 1
(SHA1)", RFC 3174, DOI 10.17487/RFC3174, September 2001,
<https://www.rfc-editor.org/info/rfc3174>.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO
10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
2003, <https://www.rfc-editor.org/info/rfc3629>.
[RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifier (URI): Generic Syntax", STD 66,
RFC 3986, DOI 10.17487/RFC3986, January 2005,
<https://www.rfc-editor.org/info/rfc3986>.
[RFC6234] Eastlake 3rd, D. and T. Hansen, "US Secure Hash Algorithms
(SHA and SHA-based HMAC and HKDF)", RFC 6234,
DOI 10.17487/RFC6234, May 2011,
<https://www.rfc-editor.org/info/rfc6234>.
[RFC6920] Farrell, S., Kutscher, D., Dannewitz, C., Ohlman, B.,
Keranen, A., and P. Hallam-Baker, "Naming Things with
Hashes", RFC 6920, DOI 10.17487/RFC6920, April 2013,
<https://www.rfc-editor.org/info/rfc6920>.
[RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
May 2017, <https://www.rfc-editor.org/info/rfc8174>.
9.2. Informative References
[ENCDEP] Tabata, K., Okada, T., Nagamori, M., Sakaguchi, T., and S.
Sugimoto, "A Collaboration Model between Archival Systems
to Enhance the Reliability of Preservation by an Enclose-
and-Deposit Method", 2005,
<https://web.archive.org/web/20060508015635/
http://www.iwaw.net/05/papers/iwaw05-tabata.pdf>.
[LC-CONFORMANCE-SUITE]
The Library of Congress, "Test cases for validating Bagit
Implementations", commit 43bcbdf, November 2017,
<https://github.com/LibraryOfCongress/
bagit-conformance-suite/>.
Kunze, et al. Informational [Page 23]
^L
RFC 8493 BagIt October 2018
[MSFNAM] Microsoft, Inc., "Naming Files, Paths, and Namespaces",
May 2018,
<http://msdn2.microsoft.com/en-us/library/aa365247.aspx>.
[RFC5234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", STD 68, RFC 5234,
DOI 10.17487/RFC5234, January 2008,
<https://www.rfc-editor.org/info/rfc5234>.
[TN1150] Apple Inc., "Technical Note TN1150: HFS Plus Volume
Format", March 2004, <https://developer.apple.com/legacy/
library/technotes/tn/tn1150.html>.
[UNICODE-TR15]
Unicode Consortium, "Unicode Standard Annex #15: Unicode
Normalization Forms", Technical Report, Unicode 11.0.0,
May 2018, <http://www.unicode.org/reports/tr15/>.
Acknowledgements
BagIt benefitted from the thoughtful assistance of Stephen Abrams,
Mike Ashenfelder, Dan Chudnov, Dave Crocker, Scott Fisher, Brad
Hards, Erik Hetzner, Keith Johnson, Leslie Johnston, David Loy, Mark
Phillips, Tracy Seneca, Stian Soiland-Reyes, Brian Tingle, Adam
Turoff, and Jim Tuttle.
Contributors
Additional contributors to the authoring of BagIt are Andy Boyko,
David Brunton, Rosie Storey, Ed Summers, Brian Vargas, and Kate
Zwaard.
Kunze, et al. Informational [Page 24]
^L
RFC 8493 BagIt October 2018
Authors' Addresses
John A. Kunze
California Digital Library
415 20th St, 4th Floor
Oakland, CA 94612
United States of America
Email: jak@ucop.edu
Justin Littman
Stanford Libraries
518 Memorial Way
Stanford, CA 94305
United States of America
Email: justinlittman@stanford.edu
Liz Madden
Library of Congress
101 Independence Avenue SE
Washington, DC 20540
United States of America
Email: emad@loc.gov
John Scancella
Email: john.scancella@gmail.com
Chris Adams
Library of Congress
101 Independence Avenue SE
Washington, DC 20540
United States of America
Email: cadams@loc.gov
Kunze, et al. Informational [Page 25]
^L
|