Archive¶
The LogMan.io Receiver archive is immutable, column-oriented, append-only data storage of the received raw logs.
Each commlink feeds data into the stream.
The stream is a infinite table with fields.
The stream name is composed by the received.
prefix, the name of the tenant and a commlink (ie. received.mytenant.udp-8889
)
The archive stream contains following fields for each log entry:
raw
: Raw log (string, digitally signed)row_id
: Primary identifier of the row unique across all streams (64bit unsigned integer)collected_at
: Date&time of the log collection at the collectorreceived_at
: Date&time of the log receival to the receiversource
: Description of the log source (string)
The source
field contains:
- for TCP inputs:
<ip address> <port> S
(S is for a stream) - for UDP inputs:
<ip address> <port> D
(D is for a datagram) - for file inputs: a filename
- for other inputs: optional specification of the source
The source
field for a log delivered over UDP
192.168.100.1 61562 D
The log was collected from IP address 192.168.100.1 and port UDP/61562.
Partition¶
Every stream is divided into partitions. Partitions of the same stream can be located on different receiver instances.
Info
Partitions can share identical periods of time. This means that data entries from the same span of time could be found in more than one partition.
Each partition has its number (part_no
), starting from 0.
This number monotonically increases for new partitions in the archive, across streams.
The partition number is globally unique, in terms of the cluster.
The partition number is encoded into the partition name.
The partition name is 6 character name, which starts with aaaaaa
(aka partition #0) and continues to aaaaab
(partition #1) and so on.
The parititon can be investigated in the Zookeeper:
/lmio/receiver/db/received.mytenant.udp-8889/aaaaaa.part
partno: 0 # The partition number, translates to aaaaaa
count: 4307 # Number of rows in this partition
size: 142138 # Size of the partition in bytes (uncompressed)
created_at:
iso: '2023-07-01T15:22:53.265267'
unix_ms: 1688224973265267
closed_at:
iso: '2023-07-01T15:22:53.283168'
unix_ms: 1688224973283167
extra:
address: 192.168.100.1 49542 # Address of the collector
identity: ABCDEF1234567890 # Identity of the collector
stream: udp-8889
tenant: mytenant
columns:
raw:
type: string
collected_at:
summary:
max:
iso: '2023-06-29T20:33:18.220173'
unix_ms: 1688070798220173
min:
iso: '2023-06-29T18:25:03.363870'
unix_ms: 1688063103363870
type: timestamp
received_at:
summary:
max:
iso: '2023-06-29T20:33:18.549359'
unix_ms: 1688070798549359
min:
iso: '2023-06-29T18:25:03.433202'
unix_ms: 1688063103433202
type: timestamp
source:
summary:
token:
count: 2
type: token:rle
Tip
Because the partition name is globally unique, it is possible to move partition to a shared storage, ie. NAS or a cloud storage from a different nodes of the cluster. The lifecycle is designed in a way that partition names will not collide, so data will not be overwritten by different receivers but reassembled correctly on the "shared" storage.
Lifecycle¶
The partition lifecycle is defined by phases.
The ingest partitions are partitions that receives the data. Once the ingest is completed, aka rotated to the new partition, the former partition is closed. The partition cannot be reopen.
When the partition is closed, the partition lifecycle starts. Each phase is configured to point to a specific directory on the filesystem.
The lifecycle is defined on the stream level, at /lmio/receiver/db/received...
entry in the ZooKeeper.
Tip
Partitions can be also moved manualy into a desired phase by the API call.
Default lifecycle¶
The default lifecycle consists of three phases: hot, warm and cold.
graph LR
I(Ingest) --> H[Hot];
H --1 week--> W[Warm];
W --3 months--> D(Delete);
H --immediately-->C[Cold];
C --18 months--> CD(Delete);
The ingest is done into the hot phase. Once the ingest is completed and the partition is closed, the partition is copied into the cold phase. After a week, the partition is moved to the warm phase. It means that the partition is duplicated - one copy is in the cold phase storage, the second copy is in the warm phase storage.
The partition on the warm phase storage is deleted after 6 months.
The partition on the cold phase storage is compressed using xz/LZMA. The partition is deleted from the cold phase after 18 months.
Default lifecycle definition
define:
type: jizera/stream
ingest: # (1)
phase: hot
rotate_size: 30G
rotate_time: daily
lifecycle:
hot:
- move: # (2)
age: 1w
phase: warm
- copy: # (3)
phase: cold
warm:
- delete: # (4)
age: 3M
cold:
- compress: # (5)
type: xz
preset: 6
threads: 4
- delete: # (6)
age: 18M
- Ingest new logs into the hot phase.
- After one week, move the partition from a hot to a warm phase.
- Copy the partition into a cold phase immediately after closing of ingest.
- Delete the partition after 3 months.
- Compress the partition immediatelly on arrival to the cold phase.
- Delete the partition after 18 months from the cold phase.
The phase storage tiers recommendations:
- Hot phase should be located on SSDs
- Warm phase should be located on HDDs
- Cold phase is an archive, could be located on NAS or slow HDDs.
Note
For more information, visit the Administration manual, chapter about Disk storage.
Lifecycle rules¶
move
: Move the partition at specifiedage
to the specifiedphase
.copy
: Copy the partition at specifiedage
to the specifiedphase
.delete
: Delete the partition at specifiedage
.
The age
can be e.g. "3h" (three hours), "5M" (five months), "1y" (one year) and so on.
Supported age
postfixes:
y
: year, respectively 365 daysM
: month, respectively 31 daysw
: weekd
: dayh
: hourm
: minute
Note
If age
is not specified, then the age is set to 0, which means that the lifecycle action is taken immediately.
Compression rule¶
compress
: Compress the data on receival to the phase.
Currently type: xz
is supported with following options:
preset
: The xz compression preset.
The compression preset levels can be categorised roughly into three categories:
0 ... 2
Fast presets with relatively low memory usage. 1 and 2 should give compression speed and ratios comparable to bzip2 1 and bzip2 9, respectively.
3 ... 5
Good compression ratio with low to medium memory usage. These are significantly slower than levels 0-2.
6 ... 9
Excellent compression with medium to high memory usage. These are also slower than the lower preset levels.
The default is 6.
Unless you want to maximize the compression ratio, you probably don't want a higher preset level than 7 due to speed and memory usage.
threads
: Maximum number of CPU threads used for a compression.
The default is 1.
Set to 0 to use as many threads as there are processor cores.
Manual decompression
You can use xz --decopress
or unxz
from XZ Utils.
You can use Z-Zip to decompress archive files on Windows.
Always work on the copy of files in the archive; copy all files out of the archive first, and don't modify (decompress) files in the archive.
Replication rule¶
replica
: Specify the number of data copies (replicas) should be present in the phase.
Replicas are stored on a different receiver instances, so that the number of replicas should NOT be greater than the number of receivers in the cluster that operates a given phase. Otherwise the "excessive" replica will not be created because the available receiver instance is not found.
Replication in the hot phase
define:
type: jizera/stream
lifecycle:
hot:
- replica:
factor: 2
...
factor
: A number of copies of the data in the phase, the default value is 1.
Rotation¶
A partition rotation is a mechanism that closed ingest partitions at specific conditions. When the ingest partition is closed, new data are stored in the newly created another ingest partition. This ensures more or less even slicing of the infinite stream of the data.
The rotation is configured on the stream level by:
rotate_time
: the period (iedaily
) the partition can be in the ingest moderotate_size
: the maximum size of the partition;T
,G
,M
andk
postfixes are supported using base 10.
Both options can be applied simultanously.
The default stream rotation is daily
and 30G
.
Roadmap
Only daily
option is available at the moment for rotate_time
.
Data vending¶
The data can be extracted from the archive (ie. for third party processing, migration and so one) by copying out the data directory of partitions in scope.
Use Zookeeper to identify what partitions are in scope of the vending and where they are physically located on storages.
The raw
column can be directly processed by third party tools.
When the data are compressed by the lifecycle configuration, the decompression can be needed.
Note
It means that you don't need to move partition from ie. cold phase into warm or hot phase.
Replay of the data¶
The archived logs can be replayed to subsequent central components.
Non-repudiation¶
The archive is a cryptographically secured, designed for traceability and non-repudiation. Digital signatures are used to verify the authenticity and integrity of the data, providing assurance that the logs have not been tampered with and were indeed generated by the stated log source.
This digital signature-based approach to maintaining logs is an essential aspect of secure logging practices and a cornerstone of a robust information security management system. These logs are vital tools for forensic analysis during an incident response, detecting anomalies or malicious activities, auditing, and regulatory compliance.
We use following cryptographical algorithms to ensure the security of logs: SHA256, ECDSA.
The hash function, SHA256, is applied to each raw log entry. This function takes the input raw log entry and produces a fixed-size string of bytes. The output (or hash) is unique to the input data; a slight alteration in the input will produce a dramatically different output, a characteristic known as the "avalanche effect".
This unique hash is then signed using a private signing key through the ECDSA algorithm, which generates a digital signature that is unique to both the data and the key. This digital signature is stored alongside the raw log data, certifying that the log data originated from the specified log source and has not been tampered with during storage.
Digital signatures of raw
columns are stored in the ZooKeeper (the canonical location) and in the filesystem, under the filename col-raw.sig
.
Each partition is also equipped with a unique SSL signing certificate, named signing-cert.der
.
This certificate, in conjunction with the digital signature, can be used to verify that the col-raw.data
(the original raw logs) has not been altered, thus ensuring data integrity.
Important
Please note that the associated private signing key is not stored anywhere but in the process memory for security purposes. The private key is removed as soon as the partition has finished its data ingest.
The signing certificate is issued by an internal Certificate Authority (CA).
The CA's certificate is available in ZooKeeper at /lmio/receiver/ca/cert.der
.
Digital signature verification
You can verify the digital signature by using the following OpenSSL commands:
$ openssl x509 -inform der -in signing-cert.der -pubkey -noout > signing-publickey.pem
$ openssl dgst -sha256 -verify signing-publickey.pem -signature col-raw.sig col-raw.data
Verified OK
These commands extract the public key from the certificate (signing-cert.der
), and then use that public key to verify the signature (col-raw.sig
) against the data file (col-raw.data
). If the data file matches the signature, you'll see a Verified OK
message.
Additionally, verify also the signing-cert.der
, this certificate has to be issued by the internal CA.
Practical example¶
The practical example of archive applied on the log stream from Microsoft 365.
The "cold" phase is stored on NAS, mounted to /data/nas
with XZ compression enabled.
Statistics¶
- Date range: 3 months
- Rotation: daily (typically one partition is created per day)
- Total size: 8.3M compressed, compression ratio: 92%
- Total file count: 1062
Content of directories¶
tladmin@lm01:/data/nas/receiver/received.default.o365-01$ ls -l
total 0
drwxr-x--- Jul 25 20:59 aaaebd.part
drwxr-x--- Jul 25 21:02 aaaebe.part
drwxr-x--- Jul 26 21:02 aaaebg.part
drwxr-x--- Jul 27 21:03 aaaeph.part
drwxr-x--- Jul 28 21:03 aaagaf.part
drwxr-x--- Jul 29 21:04 aaagfn.part
drwxr-x--- Jul 30 21:05 aaagjm.part
drwxr-x--- Jul 31 21:05 aaagog.part
drwxr-x--- Aug 1 21:05 aaahik.part
drwxr-x--- Aug 2 21:05 aaahmb.part
drwxr-x--- Aug 3 12:49 aaaifj.part
drwxr-x--- Aug 3 17:50 aaaima.part
drwxr-x--- Aug 3 18:46 aaaiok.part
drwxr-x--- Aug 4 18:46 aaajaf.part
drwxr-x--- Aug 5 18:46 aaajbk.part
drwxr-x--- Aug 6 18:47 aaajcj.part
drwxr-x--- Aug 7 11:33 aaajde.part
drwxr-x--- Aug 7 11:34 aaajeg.part
drwxr-x--- Aug 7 12:22 aaajeh.part
drwxr-x--- Aug 7 13:51 aaajem.part
drwxr-x--- Aug 8 09:50 aaajen.part
drwxr-x--- Aug 8 09:59 aaajfk.part
drwxr-x--- Aug 8 10:06 aaajfo.part
....
drwxr-x--- Oct 25 15:44 aadcne.part
drwxr-x--- Oct 26 06:23 aadcnp.part
drwxr-x--- Oct 26 09:54 aadcof.part
drwxr-x--- Oct 27 09:54 aadcpc.part
tladmin@lm01:/data/nas/receiver/received.default.o365-01/aadcpc.part$ ls -l
total 104
-r-------- 1824 Oct 27 09:54 col-collected_at.data.xz
-r-------- 66892 Oct 27 09:54 col-raw.data.xz
-r-------- 2076 Oct 27 09:54 col-raw.pos.xz
-r-------- 72 Oct 27 09:54 col-raw.sig
-r-------- 1864 Oct 27 09:54 col-received_at.data.xz
-r-------- 32 Oct 27 09:54 col-source-token.data.xz
-r-------- 68 Oct 27 09:54 col-source-token.pos.xz
-r-------- 68 Oct 27 09:54 col-source.data.xz
-r-------- 496 Oct 27 09:54 signing-cert.der.xz
-r-------- 1299 Oct 27 09:54 summary.yaml