Welcome back, my aspiring cyberwarriors!
Data compression makes the world go round! Well, almost. Without data compression our digital world would slow down considerably and even reading this article would be difficult and cumbersome. Nearly everything that is transmitted over the web is compressed to reduce latency and free up bandwidth.
Compression is one of those technologies that we use everyday but few truly understand. In 2023, the NSO Group used a vulnerability in the webp graphic compression to create a zero-click exploit against IPhones and Android phones (NSO is the godfather of cellphone hacking having created the notorious Pegasus malware). Because the developers at NSO Group had an intimate knowldge and understanding data compression enabled them to create this nefarious malware.
In this article we will delve into data compression to give you a basic understanding of this technology and then explore how NSO Group used this technology to generate a zero-click exploit.
What is Data Compression?
Data compression is the process of reducing the size of digital data files without significantly altering their essential information. It works by encoding data using fewer bits than the original representation, which minimizes storage requirements and speeds up file transfers.
How Data Compression Works
Data compression utilizes algorithms to identify and eliminate redundancies or irrelevant information within the data. The process involves two main components:
Compression Algorithm: This encodes the original data into a more compact representation.
Decompression Algorithm: This reconstructs the compressed data back into its original or near-original form when needed
The effectiveness of compression is often measured by the compression ratio, which compares the size of the compressed data to the original data
Types of Data Compression
There are two primary types of data compression:
Lossless Compression: This method preserves all original data and allows for perfect reconstruction. It’s typically used for text, spreadsheets, and executable files where data integrity is crucial. Lossless compression is typically used for:
Executable files
Documents
Spreadsheets
Text files
Critical system files
Common lossless compression formats include ZIP, GIF, PNG, and PDF
Lossy Compression: This technique permanently removes some data deemed redundant or imperceptible. It’s commonly used for multimedia files like images, audio, and video, where some data loss is acceptable for achieving higher compression ratios. Lossy compression is commonly used for:
Images (e.g. JPEG)
Audio (e.g. MP3)
Video (e.g. MPEG, MP4)
While lossy compression results in some data loss, it can achieve substantial file size reductions, which is useful for multimedia content where some quality loss is acceptable
Common Compression Techniques
Several techniques are employed in data compression:
Huffman Coding: Assigns shorter codes to more frequent data symbols and longer codes to less frequent ones. This compression technique is at the heart of the libwebp zero-click attack.
Run-Length Encoding (RLE): Replaces sequences of identical data values with a single value and count
Lempel-Ziv-Welch (LZW): Uses a dictionary-based approach to replace repeated sequences with references
Dictionary Coding: Replaces common data patterns with references to a predefined dictionary
Applications of Data Compression
Data compression finds applications in various fields:
File Compression: Reduces file sizes for easier storage and faster transfers (e.g., ZIP, RAR formats)
Communication: Enhances the capacity of communication channels by using less bandwidth
Cloud Computing: Maximizes cloud storage capacity and speeds up file transfers
Database Management: Improves storage efficiency and query performance in database systems
By employing these techniques and algorithms, data compression significantly reduces storage requirements, speeds up data transmission, and optimizes overall system performance across various digital applications.
The Libwebp Vulnerability
In 2023, a new and severe vulnerability has been found among the Android ecosystem that puts all Android devices, and even Apple iOS devices, at risk. It enables the attacker to send images via SMS and take control of the device with no user interaction! This vulnerability was first identified by Citizen Lab, a research lab based at the University of Toronto and famous for its tracking of the Pegasus malware. The vulnerability was first reported as CVE-2023-41064 but we have since learned that this vulnerability is ubiquitous throughout the Android ecosystem, Google chrome and many other Linux/Unix based systems. In addition, Telegram, the ToR browser, Brave, Gimp, LibreOffice and many other applications are vulnerable. This may be one of the most important vulnerabilities of our era!
This vulnerability involves a library (reusable code) developed by Google over a decade ago to compress images known as libwebp. libwebp was designed to be a more efficient method of compress images than say jpeg or other image processes algorithms. As such, it is used throughout the mobile device world and many browsers
The danger of this vulnerability is that it enables the attacker to install remote code on the device and take control with NO interaction from the user.
How Does the Exploit Work
This exploit creates a buffer overflow in the image decoder enabling the attacker to install their own remote code and control the device. libwebp uses a Huffman tables (developed by David A. Huffman in 1952, is a popular method for lossless data compression. The central principle of Huffman coding is to use shorter binary codes for more frequent elements in the data and longer codes for less frequent elements) for compression and decompression. The compressed image files contain information about the shape of the Huffman tables and those tables are constructed by the decoder. These Huffman tables are constructed in a heap (heap is a memory area what application data is stored). A specially crafted WebP file can create a Huffman tree that overflows the heap and allows the attackers code to run.
To better understand this exploit, let’s take a closer look at Huffman encoding or Huffman compression.
Huffman Encoding
Huffman coding is an efficient data compression algorithm that works by assigning variable-length codes to characters based on their frequency of occurrence. Here’s how the Huffman coding algorithm works:
Frequency Analysis
Calculate the frequency of each unique character in the input data
Sort the characters in ascending order of their frequencies
Tree Construction
Create leaf nodes for each unique character, with their frequencies as weights
Place these nodes in a priority queue, sorted by frequency
While there is more than one node in the queue:
Extract the two nodes with the lowest frequencies
Create a new internal node with these two as children
Set the new node’s frequency as the sum of its children’s frequencies
Insert the new node back into the priority queue
The last remaining node becomes the root of the Huffman tree
Code Assignment
Traverse the tree from the root to each leaf node
Assign ‘0’ for left branches and ‘1’ for right branches
The path from root to leaf becomes the Huffman code for that character
Encoding
Replace each character in the input data with its corresponding Huffman code
Decoding
Read the encoded bits one by one, traversing the Huffman tree
When a leaf node is reached, output the corresponding character
Return to the root and continue until all bits are processed
The Huffman coding algorithm ensures that more frequent characters have shorter codes, while less frequent characters have longer codes. This results in efficient compression, especially for data with non-uniform character distributions.
Summary
Data compression is a key technology that all of us use every day and few of us think about it. It is critical for the smooth functioning of the Internet and nearly all digital technologies but especially those involved in streaming video, audio and graphics files.
Although most of us never think about compression, we use it all the time in mp4, mp3, jpeg, and other files. In the case of the NSO Group’s exploit, their deep knowledge of the Huffman Compression algorithm enabled them to build a zero-day, zero-click for cellphones and other applications that use libwebp. This type of fundamental understanding of key technologies is what we strive for here at Hackers-Arise, rather than the simple understanding of hacking tools and technologies. It is only in this way that you can elevate to the highest level of the cybersecurity food chain.