Validating large amounts of image data
by Arve Nilsen
Image data can be somewhat validated by comparing the first few bytes in the file header. This article discusses an implementation of this, combined with querying a SQL server from C#, and exposing a user interface.
The problem
Imagine having a GUI application where you can navigate to certain categories and view images within those categories. Also imagine that there is a non-zero delay between each navigation step and you have to check many images to find the ones that are corrupted. This is obviously time-consuming when performed manually, it is also something that could be automated. Let's do that! There are three immediate points to resolve: retrieving image data, validating said data, and designing a user interface.
Creating a plan
Retrieving the binary data from the database
The schema exposes a few fields of interest: image name, file extension, and binary data. The image name and file extension is not actually important for the process but it is a natural handle for reporting the result to the user.
Validation of the image data
One option is to reach for some graphics API, attempt to read the files, and handle exceptions should the file be corrupt.
The application is running on Windows and thus GDI+ (Graphics Device Interface Plus) is available.
Implementation could mean reading the binary data into a stream and passing it to Image.FromStream.
This would also mean reading untrusted binary data, and it would mean reading the whole file.
Is there a better way?
A dynamic list of file formats by WikipediaA file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file.
Grabbing the magic numbers of the relevant image formats gives birth to a plan. Reading the first few bytes of each file is likely more efficient than reading the while file. It is also less untrusted binary data to parse. The images are stored in a database, a SQL Server in this case, but that won't affect anything apart from the implementation details.
That gives the following rough plan: Read binary data from database. Test the first few bytes against the magic numbers of the supported image formats. If one matches, the image data is assumed to be valid. In the case of no matches, the image data is assumed to be corrupt. This is obviously not exhaustive.
The user interface
Using System.CommandLine
The implementation
The database in question is a SQL server, and therefore the trio Visual Studio, C# and .NET is a highly productive solution.
Microsoft.Data.SqlClient delivers all needed functionality for this use case.
Here is a minimal example of connecting, querying, and extracting the binary data. Error handling omitted for brevity. Link to full source to be edited in.
using (SqlConnection connection = new("connection string"))
{
SqlCommand command = new("SQL query", connection);
connection.open();
using (SqlDataReader reader = command.ExecuteReader())
{
while (reader.read())
{
// Get the name and extension here if needed
byte[] imageData = (byte[])reader["binary data column name"];
// Validating the image data here. See next section
}
}
// Results are calculated and presented
}
Validating the image data is done in the inner loop of the program and thus motivates planning for performance. Only the most obvious steps are taken at this point and further optimisation must be guided by profiling.
The magic bytes of the supported image formats can be constructed as static read only outside the validation method. The binary data is passed as a read only span of bytes to a method that attempt to validate the data by each of the supported formats.
private static readonly byte[] JpgHeader = { 0xFF, 0xD8, 0xFF };
// Add the rest of the supported formats here
private static readonly List Headers = new()
{
JpgHeader, ...
};
private static bool HasKnowImageSignature(ReadOnlySpan data)
{
foreach (var header in Headers)
{
if (data.Length >= header.Length &&
data.Slice(0, header.Length).SequenceEqual(header))
{
return true;
}
}
return false;
}