Encriptação e De-duplicação em Ambientes Cloud. Inimigos mortais? Parte 1 de 2

Olá a todos,

Quem me conhece sabe que frequento dois sub-reddits assiduamente.
Num desses dois, dedicado a pessoas com problemas compulsivos de IT hoarding (sim já é considerado um distúrbio mental), existe um debate extremamente ativo sobre se os cloud providers efetuam (ou conseguem efetuar de-duplicação de conteúdos) caso os dados de uma cloud drive estejam encriptados.

Nesta discussão existe quem defenda que a encriptação impede a de-duplicação, pois caso existam dois ou mais utilizadores com o mesmo ficheiro, mas ambos encriptados de keys diferentes, os meta-dados ao serem diferentes irão impedir a poupança de espaço associada.

O que é a de-duplicação?

Na sua descrição mais simplista, trata-se de recuperar espaço disponível, eliminando dados duplicados num disco, mantendo a funcionalidade de uma forma transparente ao utilizador final.

Existem várias formas de se conseguir concretizar este processo (copia do artigo original que pode ser consultado aqui):

File-level deduplication

Also commonly referred to as single-instance storage (SIS), file-level data deduplication compares a file to be backed up or archived with those already stored by checking its attributes against an index. If the file is unique, it is stored and the index is updated; if not, only a pointer to the existing file is stored. The result is that only one instance of the file is saved and subsequent copies are replaced with a “stub” that points to the original file.

Block-level deduplication

Block-level data deduplication operates on the sub-file level. As its name implies, the file is typically broken down into segments — chunks or blocks — that are examined for redundancy vs. previously stored information.

The most popular approach for determining duplicates is to assign an identifier to a chunk of data, using a hash algorithm, for example, that generates a unique ID or “fingerprint” for that block. The unique ID is then compared with a central index. If the ID exists, then the data segment has been processed and stored before. Therefore, only a pointer to the previously stored data needs to be saved. If the ID is new, then the block is unique. The unique ID is added to the index and the unique chunk is stored.

The size of the chunk to be examined varies from vendor to vendor. Some have fixed block sizes, while others use variable block sizes (and to make it even more confusing, a few allow end users to vary the size of the fixed block). Fixed blocks could be 8 KB or maybe 64 KB — the difference is that the smaller the chunk, the more likely the opportunity to identify it as redundant. This, in turn, means even greater reductions as even less data is stored. The only issue with fixed blocks is that if a file is modified and the deduplication product uses the same fixed blocks from the last inspection, it might not detect redundant segments because as the blocks in the file are changed or moved, they shift downstream from the change, offsetting the rest of the comparisons.

Variable-sized blocks help increase the odds that a common segment will be detected even after a file is modified. This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The tradeoff? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.

Para tirar a duvida se dados encriptados podem ser de-duplicados, e em qual dos métodos se obtém melhor rácio de recuperação de espaço livre em disco, decidi instalar um SuSE 42.3, com BTRFS, utilizando 6 videos copiados desde o meu telemóvel, dois com o rclone fuse mount em modo encriptado mas com chaves de encriptação diferente  e outro com um rclone em acesso direito.

Assim sendo, foram criados três discos virtuais qcow2 sem flags especificas de compressão:

# qemu-img create -f qcow2 post_simples.qcow2 32G
Formatting 'post_simples.qcow2', fmt=qcow2 size=34359738368 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
# qemu-img create -f qcow2 post_crypt.qcow2 32G
Formatting 'post_crypt.qcow2', fmt=qcow2 size=34359738368 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16
# qemu-img create -f qcow2 newcrypt.qcow2 32G
Formatting 'newcrypt.qcow2', fmt=qcow2 size=34359738368 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16

E adicionados á VM que irá ser usada para fazer o teste.

Disk /dev/vde: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk /dev/vdf: 32 GiB, 34359738368 bytes, 67108864 sectors
Disk /dev/vdg: 32 GiB, 34359738368 bytes, 67108864 sectors

Finalmente cada um deles foi formatado em BTRFS:

linux-032h:~ # mkfs.btrfs /dev/vde
btrfs-progs v4.5.3+20160729
See http://btrfs.wiki.kernel.org for more information.

Label: (null)
UUID: cbe99eeb-5573-4b7a-b584-92389b97b090
Node size: 16384
Sector size: 4096
Filesystem size: 32.00GiB
Block group profiles:
 Data: single 8.00MiB
 Metadata: DUP 1.01GiB
 System: DUP 12.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
 ID SIZE PATH
 1 32.00GiB /dev/vde

linux-032h:~ # mkfs.btrfs /dev/vdf
btrfs-progs v4.5.3+20160729
See http://btrfs.wiki.kernel.org for more information.

Label: (null)
UUID: c373256a-bee3-4040-bf60-d3f0e767cb15
Node size: 16384
Sector size: 4096
Filesystem size: 32.00GiB
Block group profiles:
 Data: single 8.00MiB
 Metadata: DUP 1.01GiB
 System: DUP 12.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
 ID SIZE PATH
 1 32.00GiB /dev/vdf
linux-032h:~ # mkfs.btrfs /dev/vdg
btrfs-progs v4.5.3+20160729
See http://btrfs.wiki.kernel.org for more information.

Label: (null)
UUID: c3ca9473-8dde-447a-936d-65900c809d73
Node size: 16384
Sector size: 4096
Filesystem size: 32.00GiB
Block group profiles:
 Data: single 8.00MiB
 Metadata: DUP 1.01GiB
 System: DUP 12.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Number of devices: 1
Devices:
 ID SIZE PATH
 1 32.00GiB /dev/vdg

Nota: escolhi BTRFS pois é dos poucos FS’s em Linux que estão na tree de kernel (sem serem aftermarket), que suportam nativamente de-duplicação ,e é o que tem mais documentação relativa ao tema.
Nota que o XFS suporta igualmente de-duplicação, mas apenas em preview pelo que o código ainda se apresenta algo instável e como tal não o testaremos desta vez.

Após a instalação do servidor, foi necessário o adaptar a uma realidade que o rclone conhecesse como remote crypt, incluindo o transporte.

Instalei para tal um  nfs server que servirá de transporte:

linux-032h:~ # zypper install nfs-utils
Loading repository data...
Reading installed packages...
'nfs-utils' not found in package names. Trying capabilities.
Resolving package dependencies...

The following NEW package is going to be installed:
 nfs-kernel-server

1 new package to install.
Overall download size: 123.8 KiB. Already cached: 0 B. After the operation, additional 444.6 KiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package nfs-kernel-server-1.3.0-29.3.1.x86_64 (1/1), 123.8 KiB (444.6 KiB unpacked)
Retrieving: nfs-kernel-server-1.3.0-29.3.1.x86_64.rpm ................................................................................................................................................................................................................[done]
Checking for file conflicts: .........................................................................................................................................................................................................................................[done]
(1/1) Installing: nfs-kernel-server-1.3.0-29.3.1.x86_64 ..............................................................................................................................................................................................................[done]

E a ferramenta de dedupe do BTRFS:

linux-032h:~ # zypper install duperemove
Loading repository data...
Reading installed packages...
Resolving package dependencies...

The following NEW package is going to be installed:
 duperemove

1 new package to install.
Overall download size: 71.5 KiB. Already cached: 0 B. After the operation, additional 223.0 KiB will be used.
Continue? [y/n/...? shows all options] (y): y
Retrieving package duperemove-0.10.beta4-7.1.x86_64 (1/1), 71.5 KiB (223.0 KiB unpacked)
Retrieving: duperemove-0.10.beta4-7.1.x86_64.rpm .......................................................................................................................................................................................................................[done]
Checking for file conflicts: ...........................................................................................................................................................................................................................................[done]
(1/1) Installing: duperemove-0.10.beta4-7.1.x86_64 .....................................................................................................................................................................................................................[done]
linux-032h:~ #

E após configurar o servico de NFS temos agora o nosso test bed pronto:

linux-032h:~ # df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 2.0G 0 2.0G 0% /dev
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 2.0G 1.7M 2.0G 1% /run
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/mapper/system-root 10G 1.8G 8.3G 18% /
tmpfs 396M 0 396M 0% /run/user/0
/dev/vde 32G 17M 30G 1% /mnt/1
/dev/vdf 32G 17M 30G 1% /mnt/2
/dev/vdg 32G 17M 30G 1% /mnt/3
linux-032h:~ # showmount -e 172.16.4.50
Export list for 172.16.4.50:
/mnt/3 *
/mnt/2 *
/mnt/1 *

No sistema cliente, montamos ambos os shares de NFS:

client:~ # mkdir /mnt/test/1 -p
client:~ # mkdir /mnt/test/2 -p
client:~ # mkdir /mnt/test/3 -p
client:~ # mount -t nfs 172.16.4.50:/mnt/1 /mnt/test/1/
client:~ # mount -t nfs 172.16.4.50:/mnt/2 /mnt/test/2
client:~ # mount -t nfs 172.16.4.50:/mnt/3 /mnt/test/3
client:~ # df

Filesystem 1K-blocks Used Available Use% Mounted on

172.16.4.50:/mnt/1 33554432 16896 31439872 1% /mnt/test/1
172.16.4.50:/mnt/2 33554432 16896 31439872 1% /mnt/test/2
172.16.4.50:/mnt/3 33554432 16896 31439872 1% /mnt/test/3

Em seguida, foi a altura de configurar o rclone. Todas as passwords foram geradas aleatoriamente na altura da configuração.

[DRIVE1]
type = local
nounc = true

[DRIVE2]
type = local
nounc = true

[DRIVE3]
type = local
nounc = true

[DRIVE2CRYPT]
type = crypt
remote = DRIVE2:/mnt/test/2
filename_encryption = standard
password = glJtiB9t64Qy33of8gxEnnxm8ze4z1-9FfuRwtduk2jKiJT_cj-8D-t1-tdyxfEtKJezF004NUVxW00OfbySZ5LpqEX3qcKxA2O0vd5umpwX1gQmSXHG_y0T44nEi15nLSxEWzqqm3KuHw37rWnererrciRFlPmVHprCj7ysnVpOGmJlXTOjFbvP3emgcWggZgimOicsPNouwveGitVNUYYfvB7VSJG3C4-iNe-1i_qnA6OF6twoQIZDVg
password2 = WG8dUWeZhxFgrp2wBr93eYxZeHUmQ6j0eNwLIqqsBiNRy4S1YOe1RGhMImF1o8GTcGb6HYfjb5d26NxhmJF6Ebmo9Sc8Mqxge6nB3SGw0xNXWOOyG-Zjph2ZbMX63hv0PzyUV8fKR9aPb5R_1x7Dzm37B-eHqAy6qSE3YApDYRXJ99-yfDMzTyFMUnNYANMyuoWkpmS9ChbNWlZQ2XawaH_Wi1fV85SQmhss0G-9ZbbT8nq1zOtjk2uLQg

[DRIVE3CRYPT]
type = crypt
remote = DRIVE3:/mnt/test/3
filename_encryption = standard
password = WYWozrWhEX9ddZH9K0B7pzJGVG2I0cXWRBasbxWJwgAVFRwvTb5WIYrN-2Jkg6LRVTCg6LFWPhCNp7iP9YFvGjCT8zfZunDXl9oDaPmdk4YvIoHXkQ7VuxzKoCETCT--HuzJoUel21tQ0LSsTaxPNz0QBz-IHH9eqgqOl52nZKALzsbz-njOGT_BzH9Ws7P0BDh3LNlgP2XFrhIDpq6lCS_zIPrBYeIIUPxjO0O9jFu0pIsGtNqN2Tj9yg
password2 = -B8hMwhZEd-0Vvt9Ad6K2nFMl8apmXYlpHr-_Wq5tcR_zZp0kjMgPuBMTxGY_Nx2JF5soSuIduXIeqN9oWDstq7nqBACBuSvlDei1CqYG3L4dW9vF0PhsGs2QqBQLiSxXusR_eAZ1Dyyl24nstCj3rYilxeb03vDoZ3cXt-nJiKCJPKBe9IIRooQRuDmHLSsoOcSLKYWukfTHSXCqr6PJdx70yz3FAaj6B3Xi7khy5piJkZF94NAcphZuQ

Com os fuse mountpoints já ativos utilizando o rclone:

DRIVE2CRYPT: 1099511627776 0 1099511627776 0% /mnt/arquivos1
DRIVE3CRYPT: 1099511627776 0 1099511627776 0% /mnt/arquivos2

Chegamos ao momento de testar o mecanismo de encriptação:

client:/mnt/arquivos # ls -lah /mnt/test/1
total 16K

client:/mnt/arquivos # ls -lah /mnt/test/2
total 16K

client:/mnt/arquivos # ls -lah /mnt/arquivos/
total 0
client:/mnt/arquivos # ps aux > ola
client:/mnt/arquivos # ls -lah /mnt/arquivos/
total 25K

-rw-r--r-- 1 root root 25K May 8 16:47 ola
client:/mnt/arquivos # ls -lah /mnt/test/2
total 44K

-rw-r--r-- 1 nobody nogroup 25K May 8 16:47 hssnmam44vphqkkm9e5n5ma8nk
client:/mnt/arquivos # ls -lah /mnt/test/1
total 16K

Validando que os MD5’sums do mesmo ficheiro são efetivamente diferentes caso ele esteja a ser visto via o fusemount de encriptação, ou o seu transporte NFS.

client:/tmp # md5sum /mnt/arquivos/ola 
d9dc547df39d38da7ec3153a8c220324 /mnt/arquivos/ola

client:/tmp # md5sum /mnt/test/2/hssnmam44vphqkkm9e5n5ma8nk 
343269bef7a5121cec1a52c3ba89f69e /mnt/test/2/hssnmam44vphqkkm9e5n5ma8nk

Chegamos a altura do nosso teste propriamente dito.
Para o efeito, foram carregados 6 ficheiros diferentes .mkv, gerados desde filmagens do meu telemóvel, e colocados em cada um dos shares:

/dev/vdf 33554432 27605972 3920172 88% /mnt/2
/dev/vde 33554432 27599136 3927232 88% /mnt/1
/dev/vdg 33554432 27605972 3920172 88% /mnt/3

Finalmente iniciamos o processo de de-duplicação dos ficheiros:

linux-032h:/mnt # duperemove -r -d /mnt/1 && duperemove -r -d /mnt/2 && duperemove -r -d /mnt/3/
Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /mnt/1/01.mkv [1/6] (16.67%)
csum: /mnt/1/02.mkv [2/6] (33.33%)
csum: /mnt/1/03.mkv [3/6] (50.00%)
csum: /mnt/1/04.mkv [4/6] (66.67%)
csum: /mnt/1/05.mkv [5/6] (83.33%)
csum: /mnt/1/06.mkv [6/6] (100.00%)

Hashing completed. Calculating duplicate extents - this may take some time.
[########################################]
Search completed with no errors. 
Simple read and compare of file data found 3 instances of extents that might benefit from deduplication.
Showing 2 identical extents with id 1c1c53c1
Start Length Filename
0 4691113031 "/mnt/1/01.mkv"
0 4691113031 "/mnt/1/04.mkv"
Showing 2 identical extents with id aaff7af2
Start Length Filename
0 4695534664 "/mnt/1/03.mkv"
0 4695534664 "/mnt/1/06.mkv"
Showing 2 identical extents with id a401a81f
Start Length Filename
0 4696147805 "/mnt/1/02.mkv"
0 4696147805 "/mnt/1/05.mkv"
Using 4 threads for dedupe phase
[0x1a621e0] Try to dedupe extents with id 1c1c53c1
[0x1a62190] Try to dedupe extents with id aaff7af2
[0x1a62050] Try to dedupe extents with id a401a81f
[0x1a62190] Dedupe 1 extents (id: aaff7af2) with target: (0, 4695534664), "/mnt/1/03.mkv"
[0x1a621e0] Dedupe 1 extents (id: 1c1c53c1) with target: (0, 4691113031), "/mnt/1/01.mkv"
[0x1a62050] Dedupe 1 extents (id: a401a81f) with target: (0, 4696147805), "/mnt/1/02.mkv"

Kernel processed data (excludes target files): 14082795500
Comparison of extent info shows a net change in shared extents of: 28165591000
Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /mnt/2/p33nl7c3bhbmqoc9hdtf4muhhc [1/6] (16.67%)
csum: /mnt/2/forj0feoenhsbnee21ct1ithes [2/6] (33.33%)
csum: /mnt/2/cm5i5v98jnis6ndo25mq66jug0 [3/6] (50.00%)
csum: /mnt/2/t4qitgrf2dh77neip0no3uddek [4/6] (66.67%)
csum: /mnt/2/snb9du6p5ope53eqs5ot1l63fc [5/6] (83.33%)
csum: /mnt/2/qfh3md862v1or3grti4sdol6v0 [6/6] (100.00%)
Hashing completed. Calculating duplicate extents - this may take some time.
[########################################]
Search completed with no errors. 
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe.
Using 128K blocks
Using hash: murmur3
Using 4 threads for file hashing phase
csum: /mnt/3//g77cl9ccbkv9pqdt3kqvn87rt0 [1/6] (16.67%)
csum: /mnt/3//01sosn0h5i0167dld4jeosdj7c [2/6] (33.33%)
csum: /mnt/3//cul7i61r098kj7331qgha5h840 [3/6] (50.00%)
csum: /mnt/3//ahfr3h1haooh36ak85khb12npg [4/6] (66.67%)
csum: /mnt/3//sd0u0lek305ncf98kf4jg4qqg8 [5/6] (83.33%)
csum: /mnt/3//lnoq1762gvn9qd6o2ja2p55cqg [6/6] (100.00%)

Hashing completed. Calculating duplicate extents - this may take some time.
[########################################]
Search completed with no errors. 
Simple read and compare of file data found 0 instances of extents that might benefit from deduplication.
Nothing to dedupe

Como resultado final temos:

/dev/vde 32G 14G 17G 44% /mnt/1
/dev/vdf 32G 27G 3.8G 88% /mnt/2
/dev/vdg 32G 27G 3.8G 88% /mnt/3

Com os mesmos conteúdos, dois dos mount points tem valores semelhantes (que estavam encriptados) e o mount point com os dados desencriptados estão efetivamente de-duplicados, aumentado assim em muito o espaço disponível.

Podemos então concluir, que utilizando o file-level deduplication, em dados encriptados,
não existe ganho de espaço, sendo que muito do espaço que poderia ser recuperado irá continuar desperdiçado.

No próximo post irei repetir os testes mas desta vez utilizando o block level deduplication de forma a chegar a uma conclusão plausível se a encriptação é a inimiga da de-duplicação ou não.

Caso tenham alguma duvida ou reparo, sabem onde me podem encontrar.

Até lá um abraço!
Nuno