Validating large amounts of image data

by Arve Nilsen

Image data can be somewhat validated by comparing the first few bytes in the file header. This article discusses an implementation of this, combined with querying a SQL server from C#, and exposing a user interface.

The problem

Imagine having a GUI application where you can navigate to certain categories and view images within those categories. Also imagine that there is a non-zero delay between each navigation step and you have to check many images to find the ones that are corrupted. This is obviously time-consuming when performed manually, it is also something that could be automated. Let's do that! There are three immediate points to resolve: retrieving image data, validating said data, and designing a user interface.

Creating a plan

Retrieving the binary data from the database

The schema exposes a few fields of interest: image name, file extension, and binary data. The image name and file extension is not actually important for the process but it is a natural handle for reporting the result to the user.

Validation of the image data

One option is to reach for some graphics API, attempt to read the files, and handle exceptions should the file be corrupt. The application is running on Windows and thus GDI+ (Graphics Device Interface Plus) is available. Implementation could mean reading the binary data into a stream and passing it to Image.FromStream. This would also mean reading untrusted binary data, and it would mean reading the whole file.

Is there a better way?

A file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file.

A dynamic list of file formats by Wikipedia

Grabbing the magic numbers of the relevant image formats gives birth to a plan. Reading the first few bytes of each file is likely more efficient than reading the while file. It is also less untrusted binary data to parse. The images are stored in a database, a SQL Server in this case, but that won't affect anything apart from the implementation details.

That gives the following rough plan: Read binary data from database. Test the first few bytes against the magic numbers of the supported image formats. If one matches, the image data is assumed to be valid. In the case of no matches, the image data is assumed to be corrupt. This is obviously not exhaustive.

The user interface

Using System.CommandLine

The implementation

The database in question is a SQL server, and therefore the trio Visual Studio, C# and .NET is a highly productive solution. Microsoft.Data.SqlClient delivers all needed functionality for this use case.

Here is a minimal example of connecting, querying, and extracting the binary data. Error handling omitted for brevity. Link to full source to be edited in.

            
            using (SqlConnection connection = new("connection string"))
            {
                SqlCommand command = new("SQL query", connection);
                connection.open();

                using (SqlDataReader reader = command.ExecuteReader())
                {
                    while (reader.read())
                    {
                        // Get the name and extension here if needed

                        byte[] imageData = (byte[])reader["binary data column name"];
                        
                        // Validating the image data here. See next section
                    }
                }
                
                // Results are calculated and presented
            }
            
            

Validating the image data is done in the inner loop of the program and thus motivates planning for performance. Only the most obvious steps are taken at this point and further optimisation must be guided by profiling.

The magic bytes of the supported image formats can be constructed as static read only outside the validation method. The binary data is passed as a read only span of bytes to a method that attempt to validate the data by each of the supported formats.

            
            private static readonly byte[] JpgHeader = { 0xFF, 0xD8, 0xFF };
            // Add the rest of the supported formats here

            private static readonly List Headers = new()
            {
                JpgHeader, ...
            };

            private static bool HasKnowImageSignature(ReadOnlySpan data)
            {
                foreach (var header in Headers)
                {
                    if (data.Length >= header.Length &&
                        data.Slice(0, header.Length).SequenceEqual(header))
                        {
                            return true;
                        }
                }

                return false;
            }
            
            

Conclusion

Tags: C# SQL Server

The obligatory "how-this-is-built" post.

by Arve Nilsen

Lives on a VPS. Served by NGINX. Written in vim. Stitched together with a shell script.

The VPS.

I have previously used Ovh and been satisfied with their offerings. This time I chose to give Hetzner a try. My main criterias was to find an affordable VPS that is physically located in Europe.

I landed on a small x86 instance running Ubuntu LTS. The monthly price is about the same as a buss ticket. In fact, with my current setup, I don't reach the monthly threshold for invoicing and therefore pay bimonthly.

A simple NGINX configuration.

I installed NGINX and configured it to serve the blog using TLS. I'll flesh out this portion of the post with more details.

The shell as a static site generator.

Content

            
                mv index.html OLDindex.html
                cat head header $(printf '%s\n' articles/* | sort -r) footer >> index.html
            
        

Tags: Infra

What to expect from this place

by Arve Nilsen

A short introduction to my intentions with this place.

Using writing as a way to structure thinking.

Writing for future me.

I forget stuff in the same rate as I learn new stuff. Therefore I remain an idiot!

Writing the stuff down might help with that.