seeking tips with reading files

jodunno · March 6, 2023

Hello php Freaks and Freakazoids

I'm still working on a file upload script and i'm at the point of code scanning. I have opened the temp file using fopen and i am using a generator to yield each line of code (save on memory). Imperative is to yield the lines, yet i am trying to accomplish two concepts with one open file. I am trying to check if each line (string) contains php code or javascript code (string contains and foreach loop.) I have no prblem with this code. It is working and i am able to catch all of those weak filtering bypass images with php code (i have tested with 12 code injected images.)

What i want to do is scan the jpeg (since it is already opened) and verify the image as having valid jpeg components. So i managed to get the markers of the Huffman table, which helps me stop those weak code injection bypass images since they lack the Huffman table data. However, it seems as though i will need to use if blocks in the lines loop, which is counter productive (they are evaluated on each loop.) I can easily change the bytes progressively as i verify them with a variable but i will still need to use an if block. Also, i could only think to check for the null bytes in order to stop the inner loop at the position that i am seeking. so for the header, i need the 74 70 73 70 JFIF bytes, then i can verify that the header exists. Now i can jump to FF C4 to get the Huffman table if it present. et cetera. I have added a passes variable to count the array and the second null byte seems to be at array index 11. I suppose that i could just use the passes variable to cut off the byte scan. However, i would like to know a better way to read bytes x to y only. How could i accomplish the scan of the bytes containing JFIF only and move on?

Is there a better way to code this image scanner? I am not a programmer and i this project is the first time that i have used fopen.

here is the code that i am referring to:

$SID_fileLines = function (): Generator {
    $SID_openFile = fopen('image.jpg', 'rb');
    while (!feof($SID_openFile)/*.*/) { yield trim(fgets($SID_openFile)); }
    fclose($SID_openFile); return;
};

foreach ($SID_fileLines() as $SID_currentLine) {
    $SID_pos = strpos($SID_currentLine, "\xFF\xDB", 0);
    $SID_header = []; $nullByte = 0; $passes = 0;
    if ($SID_pos) {
        foreach(str_split($SID_currentLine) as $byte) {
            array_push($SID_header, ord($byte));
            $passes += 1; /* passes = 11 (non-zero 12) seems to work as 2nd nullbyte. */
            if (ord($byte) === 0) { $nullByte += 1; }
            if ($nullByte === 2) { unset($byte); break; }
        }
    }

I hate to have an if block but i have no idea how i can scan the image for code injection and check the metadata at the same time. I don't want to open the image multiple times and i only want to yield each line to spare memory. Any tips?

kicken · March 6, 2023

4 hours ago, jodunno said:

Is there a better way to code this image scanner?

If your goal is to strip potentially harmful comments / metadata from an image, the way to do that would be to use an image library to re-generate the image file without that data. The image magick extension has a function for this. I'm not sure if loading then re-saving an image with GD will accomplish this or not.

Trying to just arbitrarily manipulate an image file is a poor approach. Even if it works with your test images, it may not work with all images. You'd need to have a good understanding of the file format so you can parse it and manipulate it properly, which is a lot of work when you can just use an existing library instead.

4 hours ago, jodunno said:

i only want to yield each line to spare memory.

You don't have to use generators and yield to save memory, just reading the file a bit at a time. A simple loop like this:

$fp = fopen('file', 'rb');
while (!feof($fp)){
    $line = fgets($fp);
    //do stuff
}
fclose($fp);

will also only use enough memory to hold a line's worth of file data at a time without the complication of a generator.

Often, parsing a binary format file is not something you'd do line-by-line anyway. You'd read various chunks based on the file format, possibly seeking to a particular position in the file first.

jodunno · March 6, 2023

1 hour ago, kicken said:
You don't have to use generators and yield to save memory, just reading the file a bit at a time. A simple loop like this:
$fp = fopen('file', 'rb');
while (!feof($fp)){
    $line = fgets($fp);
    //do stuff
}
fclose($fp);

Thans for the tip. I appreciate it very much. I assumed that this code read every line into memory (without checking memory myself). Thus, i aimed for a generator method. I'll go back to the method that you have specified. Thanks, Kicken.

I have to disagree that the libraries mentioned are the best way. Simply searching Google for these libraries plus hacks yields a ton of security vulnerabilities, particularly gd. But, honestly, I don't find JPEG format very complicated at all. The markers are the same in every image. The header always begins at the SOI marker xFF XD8 followed by xFF xEO. Open jpeg images in Notepadpp and you can see it yourself. The only change is from JFIF to EXIF. Meantime, the Huffman tables have been documented even in php.

smashingmagazine.com/2019/08/faster-image-loading-embedded-previews/

dev.exiv2.org/projects/exiv2/wiki/The_Metadata_in_JPEG_files#2-The-metadata-structure-in-JPEG

I am just a php hobbyist, so reading the bytes is a bit perplexing. I suppose that i just need to use an offset and increase the position by 2 until i find the second null byte, which should be x00 in hex. I am having trouble with the position moving part, since i lack the experience. I have read a ton of documents about jpeg and even downloaded some cpp examples.

I will keep playing with the code in sparetime. Eventually i will find a proper way to traverse the necessary bytes according to the specification. The current code is garbage, so i will start over tomorrow. I will read about getting chunks instead of lines.

Thank you very much for your time and expertise. I hope that you have a pleasant day.

gizmola · March 6, 2023

I'm all for academic exercises for the benefit of learning. I think you will find this page of some help in continuing to explore the jpeg and jfif standards. However if your goal is simply to verify if an image is valid or not, that is problematic, because jfif allows for sections of a file to be ignored, so that special data could be placed there when the file is created. You could look at exif as essentially being this type of extension, so using the exif check functions is valuable in combination with other techniques. Exif data doesn't have to be there, but if you decide that you will only accept images that also have exif, then that is another valuable and efficient check, as you can use an exif checking function to exclude images that don't have valid exif data.

In general, the proven method of knocking down malicious images is to use a combination of getimagesize and imagecreatefromstring, or the imagemagick routines kicken referenced. You used getimagesize to knock down files you have already decided are too large, and then recreate the image from file data. Either of these failing should cause rejection.

Trying to go through the files and decipher them is most certainly a block operation where you would want to read the binary values, looking for the segments, and have routines that can decipher those individual segments. A simple loop is not going to be maintainable in my opinion.

If I was trying to do this, I'd also want to try and see what gd and/or imagemagick source is doing, as those are both open source libraries written in c/c++.

For example, imagemagick has a component used to identify the internals of an image. It's available in their command line tool that allows analysis and modification of an image.

The source is here.

A very large and complicated bit of code it seems.

gizmola · March 6, 2023

Also, this thread is probably interesting to consider....

jodunno · March 6, 2023

Dear gizmola,

Thank you for taking time to post. I appreciate you and your expertise. I really enjoy that link to the github corkami formats page. This page summarizes alot of data that i have read over the past three weeks and even adds some data that i find most useful. Fantastic!

I have a spaghetti code handler that i've been working on and i'm almost finished. I would like to add my own file scanning code since i already have the file open. I use both filesize and getimagesize directly on the temp file (via try catch because a corrupted file causes an exception on both functions.) The getimagesize has knocked down several bypass files but 12 out of 12 properly injected files pass through the check. I have two old executables from early 200s named 'strings' and 'binary text scan'. I opened all 12 code injected images in these programs and the code is quickly spotted. I thought, "how could i do this in php?" i searched for opening image files and came to fopen. A simple read into lines allows me to foreach loop over an array of php code then use if str_contains to see if the line has code. Sure enough, all 12 files no longer pass through my upload script. I suppose that it may not be foolproof but atleast i am able to catch weak injections. Meantime, i have a copy of phpbb3 and i uploaded these 12 images into my xampp installed forum. All 12 images pass into the forum. I think that phpbb also uses gd and imagemagick.

Anyway, i figure that as long as i have a jpeg open, then i may as well learn how to scan it myself. I will continue reading and see if i can conjure up a tighter scanning segment. Then i will post it for you to test. I like my script but this scanning segment needs some work and i need to follow the specification precisely in order to produce a better scanning segment.

I go to be now. Long day. Best wishes to you and thanks again!

gizmola · March 6, 2023

Just to be clear the getimagesize is mainly to prevent gigantic files you don't want to waste time rebuilding. You are absolutely correct that it can't be depended upon to detect a file with a hidden payload. That's why you have to rebuild the image from the stringified version of it using imagecreatefromstring. This of course does require enough memory to create the file, so there's no getting around that from a memory use standpoint.

kicken · March 6, 2023

2 hours ago, jodunno said:

I have to disagree that the libraries mentioned are the best way. Simply searching Google for these libraries plus hacks yields a ton of security vulnerabilities, particularly gd.

If you want to DIY something for your own education or just for fun, so be it, but don't try and say libraries are bad because they've had bugs/vulnerabilities. That's just objectively wrong.

Pretty much every code base will have bugs/vulnerabilities, yours included.
A popular library having a serious bug/vulnerability is going to be rare.
Bugs/vulnerabilities found will be quickly fixed.
By that logic, why even use PHP? It's had it's fair share of problems over the years, surely you'd be better off just writing your code in C or assembler, right? /s

One of the main benefits of libraries and open source software is that common problems can be solved once, and everyone can benefit from that solution instead of every developer having to come up with their own (likely broken) solution.

1 hour ago, jodunno said:

I have two old executables from early 200s named 'strings' and 'binary text scan'.

If you're talking about the strings utility, that works by just looking for printable characters in a sequence of a particular length. If you wanted to do that in PHP, you'd just read the file character by character and note if it is printable or not.

strings.php

<?php

$file = $argv[1] ?? null;
$minStringLength = $argv[2] ?? 4;
if (!$file || !is_readable($file)){
    die('File not specified or not readable');
}

$fp = fopen($file, 'rb');
if (!$fp){
    die('Unable to open file');
}

$currentString = '';
while (!feof($fp)){
    $char = fgetc($fp);
    if (ctype_print($char)){
        $currentString .= $char;
    } else {
        if (strlen($currentString) > $minStringLength){
            echo 'Found: ', $currentString, PHP_EOL;
        }
        $currentString = '';
    }
}
fclose($fp);

1 hour ago, jodunno said:

i have a copy of phpbb3 and i uploaded these 12 images into my xampp installed forum. All 12 images pass into the forum.

I don't know what phpBB does, but a lot of software probably doesn't even look for code inside an image. A lot may not do any validation, or may only validate it can be at least parsed as an image file. Having code inside your image files is only really a problem if you're somehow letting that image be run as code. If you just treat it as an image and that's it, then there's no problem. Stripping the code / other unnecessary info from the image is a good security in depth idea, but making sure your server isn't trying to parse your images for PHP code in the first place is better.

jodunno · March 7, 2023

Dear gizmola,

getimagesize is great for detecting those php files saved as jpeg. I use the following code and display a message that the file may be corrupted upon any of the if blocks being true.

$SID_dimensions = (array) getimagesize($_FILES['Upload']['tmp_name']);
if (empty($SID_dimensions) || !is_array($SID_dimensions)/*.*/) {
if (empty($SID_dimensions[2]) || !in_array($SID_dimensions[2], [1, 2, 3], true)/*.*/) {
if (empty($SID_dimensions['mime']) || !in_array($SID_dimensions['mime'], ['image/jpeg', 'image/png', 'image/gif'], true)/*.*/) {
if (empty($SID_dimensions[0]) || empty($SID_dimensions[1])/*.*/) {

however, the properly injected images pass right through (for obvious reasons.) PHP lacks any functions for image scanning so thus begins my journey.

Dear Kicken, i am aware of the linux strings utility but i downloaded a strings.exe file from a hacking site many many years ago. It runs on Windows as a console app. I suppose it is a port of the strings utility.

a simple string split dechex ord conversion will reveal all of the necessary markers for a jpeg. Try it yourself using a small image. The array is very large (but you should understand this concept).

foreach(str_split($SID_currentLine) as $byte) {
    array_push($SID_bytes, dechex(ord($byte)));
}

output reveals the data that i am seeking.

print_r($SID_bytes);

Array ( [0] => ff [1] => d8 [2] => ff [3] => e0 [4] => 0 [5] => 10 [6] => 4a [7] => 46 [8] => 49 [9] => 46 [10] => 0 [11] => 1 [12] => 1 [13] => 1 [14] => 0 [15] => 48 [16] => 0
[20] => ff [21] => db
[85] => ff [86] => db
[154] => ff [155] => c0
[173] => ff [174] => c4
[203] => ff [204] => c4
[384] => ff [385] => c4
[414] => ff [415] => c4
[596] => ff [597] => da, [1118] => ff [1119] => 0, [1171] => ff [1172] => 0 etc.
[226976] => ff [226977] => d9 )

SOI 0xFF, 0xD8 header = found/located [[6] => 4a [7] => 46 [8] => 49 [9] => 46 = JFIF]
DQT 0xFF, 0xDB, Define Quantization Table = found/located
SOF0 0xFF, 0xC0, Variable size , Start Of Frame = found/located
DHT 0xFF, 0xC4 , Variable size , Define Huffman Table(s) = found/located
SOS 0xFF, 0xDA , Variable size , Start Of Scan = found/located
et cetera
EOI 0xFF, 0xD9, End Of Image = found/located

it is not rocket science. The results are easy to obtain. You are looking at the jpeg markers. maybe you are mad that i am able to do this and i'm not a programmer or something. I don't need libraries to read a jpeg. php is capable of doing it.

what i have asked has yet to be answered. How am i supposed to get only the bytes that are of use to me pushed into an array? i have tried strpos but i cannot get it to id these bytes and push them to an array with a stopping position. i guess that i have to play with the code to figure it out myself.

Best wishes, John

gizmola · March 7, 2023

I don't want to speak for Kicken, but I didn't interpret his reply as having any animus attached. It's not a personal attack, and I know he doesn't care whether or not you are a professional developer. He's been answering questions here for many years. Also, you are by definition a programmer, because you are programming

He's just making the case that most libraries have a rigor to them that your code will not. In regards to efficiency, imagemagick and gd are written in c, so they are going to be many orders of magnitude more efficient than php code you might write to open a file and read it byte by byte. They both have literally millions of users using them, and they are part of countless websites, so they have been thoroughly tested, and in many cases, studied by researchers and students looking for bugs and exploits, which are all benefits of open source.

I already expressed concern that a simple loop reading a file byte by byte is going to result in something very messy, because jpeg file structure isn't simple.

The other issue, from my point of view, was also addressed by Kicken, which is that, data hidden in a jpeg file, in places where jpeg allows for data to be added, to his point does not weaponize the image, and is also valid. This is not unlike the way computer viruses work, and why antivirus companies exist. They must constantly identify new viruses, and fingerprint them, and this job is never complete, because virus writers keep changing them and finding new ways to hide them or exploit new vulnerabilities. Going further with this analogy, a big concern with images has been "stegosploits" where the payload is hidden in the actual image data. In this case it's a valid jpeg, so I don't think you will be able to detect any issues with image of those types.

At any rate, I don't want to lose sight of what your actual problem(s) are at present.

You can not have your cake and eat it too
- As you read through the file you can recognize the start of a structure
- You can continue to read until you get to the end of the structure
  - Assuming you have now identified that structure, you can do analysis of it

In all cases, aside from a simple scan to verify the existence of certain byte sequences, you will need to retain the structures in some form, if you intend to do further analysis of them. Preserving them, means that you will have to keep them in memory. I don't see any way around that, and again, I'd expect at very least to have functions or class structure to handle individual structures and do further analysis of them.

I hope this helps, as beyond that, we are much better suited to specific problems than generalized/strategy based ones.

jodunno · March 7, 2023

i have isolated the 16 byte JFIF header including the trailer and also the JFIF EXIF header. I leave the II and MM in the EXIF header.
I have successfully isolated JFIF and EXIF headers from over twenty different test images. I have also tried most of the images found at an exif test images github page.

           foreach ($SID_fileLines() as $SID_currentLine) {

               if ($pos = strpos($SID_currentLine, "\xFF\xE0") || $pos = strpos($SID_currentLine, "\xFF\xE1")) {
                   foreach(str_split($SID_currentLine) as $byte) {
                       if (dechex(ord($byte)) === '2a') { break; }
                       if (ctype_cntrl($byte) || dechex(ord($byte)) === '2c' || utf8_encode($byte) === 'H'/*.*/) {
                           continue;
                       } else {
                           array_push($SID_c0, ord($byte));
                           array_push($SID_cc, dechex(ord($byte)));
                           array_push($SID_cf, utf8_encode($byte));
                       }
                       if (dechex(ord($byte)) === 'db') { break; }
                   }

               if (!empty($SID_cf)) {
                   foreach ($SID_cf as $char) { $SID_bytes .= $char; }
               }
               if (!empty($SID_bytes)) { echo 'header: ' . $SID_bytes . ' : length = ' . strlen($SID_bytes) . '<br><br>'; }

This process has been a pain but i am enjoying the fact that i have accomplished this task without a programming background. The code is amateur but it is working. Microsoft built an empire off of 'working' code, so it doesn't matter to me right now.

I was hoping that someone could offer code examples of a better method. For example, i have no idea where this H comes from. I guess that it has something to do with the bytes of the utf-8 encode process. I have alot to learn but atleast i was able to pull this off. Now to read and figure out a better method.

I have mentioned that i have done alot of research on this subject including stenography. Fascinating subject and i have examples of stenography (sample images.) Isolating the actual image data is quite easy (i accomplished this feat today with a large array), recognizing the stenography is not so easy.

i am tired today, so i am going to call it a day. Thank you all for the tips and advice.

Best wishes to you, honestly.

kicken · March 7, 2023

9 hours ago, jodunno said:

maybe you are mad that i am able to do this and i'm not a programmer or something.

I don't know where you got that idea from. I don't care what you ultimately decide to do, I'm merely point out out how dumb and misguided your "libraries are not worth using, they have bugs/vulnerabilities" argument is. If you ever want to get out of hobbyist programming (or just create more advanced projects) you'll have to get over your apparent fear of libraries, or else waste a ton of time re-inventing the wheel so to speak.

25 minutes ago, jodunno said:

I was hoping that someone could offer code examples of a better method.

I don't know much about the JPEG format, but binary files are rarely line-oriented which means your approach is flawed from the start, You shouldn't be trying to parse the file "line-by-line" as there's typically no such thing as a line in a binary file. You need to learn the format, then parse according to that.

If this reference is accurate, then you should be parsing the data by looking for byte sequence of \xFF\x?? where the second byte is something other than \x00. Then depending on what that second byte is, read some other amount of data to get to the next block. I don't really have the time to learn the format and provide a complete example (I'd just use a library and move on to the next problem). Maybe if I'm feeling up to it later tonight I will try something.

31 minutes ago, jodunno said:
if (dechex(ord($byte)) === '2a') { break; }

The dechex(ord()) dance is not really necessary. Character \x2A is *, so if you want to test if $byte is * you can either just test that directly such as:

if ($byte === '*'){ break; }

or if you'd prefer to keep things in hex, then test:

if ($byte === "\x2a"){ break; }

54 minutes ago, jodunno said:

For example, i have no idea where this H comes from.

For starters, 'H' doesn't have anything to do with that byte's meaning. If you look at that reference above, it's probably part of the density value, a 2-byte integer. 0x0048 == 72. You'll want to learn about using unpack to parse out such multi-byte integer value.

jodunno · March 7, 2023

So, Kicken, is this code better to your eyes (but the dang HH is still showing):

<?php
    $SID_fileLines = '';
    $SID_openFile = fopen('photo1200-96pc.jpg', 'rb'); //DSC_0001

    $SID_bytes = ''; 
    $position = 0; $SOI = 0; $signature = []; $dataBytes = 0; $trailer = '';
    while (!feof($SID_openFile)/*.*/) {
        $SID_char1 = fgetc($SID_openFile);
        $SID_char2 = fgetc($SID_openFile);
        $SID_marker = $SID_char1 . $SID_char2;

        if (empty($SOI) && "\xFF\xD8" === $SID_marker) { $SOI = 1; continue; }
        if ("\xFF\xE0" === $SID_marker) {
            array_push($signature, utf8_encode($SID_char1)/*.*/);
            array_push($signature, utf8_encode($SID_char2)/*.*/);
            while ($dataBytes < 16) {
                $SID_header = fgetc($SID_openFile);
                if (ctype_cntrl($SID_header)) { $dataBytes += 1; continue; }
                array_push($signature, utf8_encode($SID_header));
                $dataBytes += 1;
            }
        }
        if ("\xFF\xDB" === $SID_marker) {
            $trailer .= utf8_encode($SID_char1) . utf8_encode($SID_char2);
            break; //just a test to see how i could use fgetc
        }
    }
    fclose($SID_openFile);

    if (!empty($signature)) {
        foreach ($signature as $char) { $SID_bytes .= $char; }
    }
    if (!empty($SID_bytes)) { echo 'header: ' . $SID_bytes . ' : length = ' . strlen($SID_bytes) . '<br><br>'; }
    if (!empty($trailer)) { echo 'trailer: ' . $trailer . '<br><br>'; }
?>

the only thing that i can think of at this time is to store two characters plus a marker for checking. I've never read a file before the last line code project. I think that it rolls smoother than the last one but it is a bit more complex.

i really need to slep soon. My eyes are burning. Goodnight and Thanks for the tips, John

Strider64 · March 8, 2023

Personally, I just upload images privately to my own website, but I came across a tutorial a long time ago that stated simply doing this would help. Though it would be a memory hog and something a visitor would not appreciate, so I never used it.

    protected function file_contains_php() {
        $contents = file_get_contents($this->file['tmp_name']);
        $position = strpos($contents, '<?php');
        return $position !== false;
    }

There was a member (I forget his name) here a long time ago that taught me a valuable lesson in programming and that is professionals who write security code do it for a living. They test it out, verify before making it public and even then they sometimes get it wrong. So do you think you can? He was referring to me and the script I just wrote. I agreed with him and used a third party script even though it was painful in trashing that script as I spent all night writing it. 🤣

Edited March 8, 2023 by Strider64

kicken · March 8, 2023

3 hours ago, jodunno said:

So, Kicken, is this code better to your eyes (but the dang HH is still showing):

It's better in that you're not doing a line-based approach any more. What I would say is you're getting a little too specific right now. If you look at the linked reference, you'll see:

Quote

A JPEG file is a sequence of Type-Length-Value chunks called segments:

the type is defined by a marker: 2 bytes, FF then a non-zero byte (*).

the length is a big endian on 2 bytes, and covers the size itself. So the whole segment's length is 2 + length (to cover the length of the marker). This also means that any segment is at most 65537 bytes long.

What you can take from that, is your file will essentially a repeating sequence of "\xFF\x??<marker>\x????<length>\x??...<data>". The two exceptions to worry about right away are the start of image and end of image markers, they don't have a length and data component.

As such, you should start by being able to parse that repeating sequence. Don't worry about parsing what exactly is contained inside the data, just get the individual blocks. In pseudo code that'd be something like:

while (!feof($file)){
   $marker = findNextMarker($file); //Scan the file until you find a 0xFF?? value, return the ??.
   if ($marker !== 0xD8 && $marker !== 0xD9){
       $length = parseLength($file); //Read two bytes, convert them to a integer length value
       $data = fread($file, $length - 2); // -2 because the encoded length includes the two bytes read above.
   }
}

7 hours ago, kicken said:

Maybe if I'm feeling up to it later tonight I will try something.

I did this, and have a simple script that does like I said above, just parses out the different blocks and shows a hex dump of their data. I'll share eventually, but I want to see what you come up with after taking the above into consideration first.

jodunno · March 8, 2023

but i want to be specific. I'm trying to target the useful markers. useful being verification of structure and existence. data integrity and analysis is beyond my skills at this point-in-time. The goal is to check that the header is present and valid (alot of bypass images inject code after the signature and lack the trailing xDB.) I think that you mean it is too specific in that i am not grabbing the entire marker.

I have attempted to build a function but i am not sure if you are using fgetc or not. I decided to use fgetc and simply unpack a word into an array for clarification.

<?php
    $SID_openFile = fopen('photo1200-96pc.jpg', 'rb');
    function findNextMarker($file) { if (fgetc($file) === "\xFF") { return fgetc($file); } return; }
    while (!feof($SID_openFile)/*.*/) {
        $SID_marker = findNextMarker($SID_openFile);
        if (!empty($SID_marker) && $SID_marker !== "\xD8" && $SID_marker !== "\xD9") {
            $word = unpack("H*", fread($SID_openFile, 16));
            echo dechex(ord($SID_marker)) . ' : '; print_r($word); echo '<br>';
        }
    }
    fclose($SID_openFile);
?>

so is the function what you are suggesting?

kicken · March 8, 2023

3 hours ago, jodunno said:

but i want to be specific

That is the end goal, not something you should be jumping straight to. Basic problem solving is to break the problem down into smaller components, so the problem of "How do I verify JPEG and strip unwanted stuff" breaks down into steps

How do I parse a jpeg?
How do I find the unwanted stuff?
How do I remove the unwanted stuff?

Step one, how do you parse the jpeg can be further broken down into it's own steps:

How do I find the markers?
How do I extract the data associated with those markers?
How do I parse that marker data (will vary for each marker type)

If you can successfully find each marker type and it's associated data, you can easily make a function to parse that marker's data for the details you need. Extending the pseudo code above for example:

while (!feof($file)){
    $marker = findNextMarker($file); //Scan the file until you find a 0xFF?? value, return the ??.
    if ($marker !== 0xD8 && $marker !== 0xD9){
        $length = parseLength($file); //Read two bytes, convert them to a integer length value
        $data = fread($file, $length - 2); // -2 because the encoded length includes the two bytes read above.
        switch ($marker){
            case 0xE0: parseApp0Header($data); break;
            case 0xC0: parseStartOfFrame($data); break;
            case 0xC4: parseHuffmanTable($data); break;
            //... whatever other markers you're interested in.
        }
    }
}

jodunno · March 8, 2023

3 hours ago, kicken said:

That is the end goal, not something you should be jumping straight to

I want to open a JPEG JFIF photo and find the markers. It is the end goal? that is ridiculous. I have looked at pseudo code that matches your pseudo code and i notice several things:

1. your code checks for any instance of 0xFF, which using my sample photo equates to 84 unnecessary evaluations. The markers need to be defined by the code that is seeking them. For example, i have an array of markers with empty values. If empty marker[] then continue, which leaves me with 11 passes through my marker analysis code. Image data has 256 byte markers followed by a null byte. 0xFF 0x00. Why do you check these bytes?

2. by skipping the soi and eoi, you're not checking file integrity. Your code has no way of telling one if those bytes exist or not.

3. you're dangerously placing unknown marker data into a string. Hopefully you are not planning on using functions that may execute code. I prefer to store the data in an array where i can analyze the single characters of the array before doing anything with them collectively.

4. I'm not stripping any data from the image (removing unwanted stuff). I am checking the file for code injection (excluding stenography of pixels).

5. If you only use a function with an 0xFF fgetc check, all of the return data of the marker will be counted. I have already done this today. I print the return value of the function and all 16 bytes appear with the marker, e.g. E0. all 17 bytes appear with the C0 marker. I don't see why you are not grabbing the data immediately.

It is nice that you have made your own version but i find it too far away from the path for my liking. I already have the data and i only make 11 passes through my code to get the data.

Best wishes, John

kicken · March 8, 2023

4 minutes ago, jodunno said:

1. your code checks for any instance of 0xFF,

It's pseudo-code, it doesn't search for anything specific. Pseudo code is only an outline of the steps, not a full implementation. The function is called 'findNextMarker'. A marker is defined in the file format as: the type is defined by a marker: 2 bytes, FF then a non-zero byte (*).

So no, it's not supposed to look for just any 0xFF. It supposed to look for 0xFF followed by any byte other than 0x00.

8 minutes ago, jodunno said:

by skipping the soi and eoi,

It's not skipping SOI or EOI. It's just not parsing them for a data segment because they do not have one. Again, from the file format reference:

Quote

A few types of markers are parameter-less: no length, no value, just a marker:

the magic signature, at offset 0, called Start of Image (SOI): FF D8

the terminator, at the end of the file, called End of Image (EOI): FF D9

9 minutes ago, jodunno said:

you're dangerously placing unknown marker data into a string. Hopefully you are not planning on using functions that may execute code. I prefer to store the data in an array where i can analyze the single characters of the array before doing anything with them collectively.

A string is effectively just an array of characters. There's no difference between the two from a security perspective. PHP code also isn't subject to something like a buffer overflow error leading to arbitrary code execution (unless there's some problem in the PHP engine itself). You can also analyze a string just as easily (or easier) as an array of individual characters.

13 minutes ago, jodunno said:

I'm not stripping any data from the image (removing unwanted stuff)

Then ignore that step, it doesn't change the others.

14 minutes ago, jodunno said:

It is nice that you have made your own version but i find it too far away from the path for my liking.

You haven't even seen my actual code yet, so I'm not sure how you're able to judge it. Since I said I'd share after letting you ponder the advice for a while though, here it is:

<?php

function parseJpeg(string $file) : array{
    $fp = fopen($file, 'rb');
    if (!$fp){
        throw new \RuntimeException('Unable to open file');
    }

    //First two bytes should be \xFFD8.
    if (fread($fp, 2) !== "\xFF\xD8"){
        throw new \RuntimeException('Invalid image file.');
    }

    $output = [];
    //Find each segment by looking for the marker values \xFF??
    while (!feof($fp) && ($marker = findNextMarker($fp))){
        $blockData = [
            'marker' => $marker
        ];
        //If the marker is not the end of image marker.
        if ($marker !== 0xD9){
            //Parse the segment data for this marker.
            $blockData['segmentData'] = parseSegmentData($fp);

            if ($marker === 0xDA){ //If the marker is a start of scan marker.
                //Parse the image data that follows.
                $blockData['imageData'] = parseImageData($fp);
            } else if ($marker === 0xE0){ //If the marker is the app0 header.
                $blockData['headerData'] = parseApp0Header($blockData['segmentData']);
            }

            $output[] = $blockData;
        } else {
            $output[] = $blockData;
            break;
        }
    }

    fclose($fp);

    return $output;
}

function parseSegmentData($fp) : string{
    //Markers indicate the start of a segment which is composed of <length><data> sections.
    //The length is two bytes and is the length of the entire segment including the two bytes
    //used to define the length.

    //Extract the length from the next two bytes.
    $dataLength = unpack('nlength', fread($fp, 2))['length'];

    //Read the remaining data using that length value.  Subtract 2 because
    //$dataLength includes the two bytes we just read to obtain the length
    $data = fread($fp, $dataLength - 2);

    return $data;
}

function parseImageData($fp) : string{
    $imageData = [];
    //Read data until we find another marker or hit end of file.
    while (!feof($fp)){
        $c = fread($fp, 1);
        //We might have found another marker.
        if ($c === "\xFF"){
            //Save our position in the file
            //If we found a marker, we need to rewind to just before it.
            $pos = ftell($fp);

            //We only found a marker if the next byte is not \x00
            $next = fread($fp, 1);
            if ($next !== "\x00"){
                //Rewind the file to just before the marker we just found and exit the loop.
                fseek($fp, $pos - 1);
                break;
            }
        }
        $imageData[] = $c;
    }

    $imageData = implode('', $imageData);

    return $imageData;
}

function parseApp0Header(string $data) : ?array{
    $unpacked = unpack('Z5id/c2version/cunits/n2dpi/c2thumb', $data);
    if (!$unpacked){
        return null;
    }

    return [
        'id' => $unpacked['id']
        , 'version' => $unpacked['version1'] . '.' . $unpacked['version2']
        , 'units' => $unpacked['units']
        , 'density' => $unpacked['dpi1'] . 'x' . $unpacked['dpi2']
        , 'thumbnail' => $unpacked['thumb1'] . 'x' . $unpacked['thumb2']
    ];
}

function findNextMarker($fp) : int{
    //Scan the file content for the next \xFF?? marker.
    //This scans one byte at a time which is terrible for
    //performance but easy.  Loading more data into memory
    //and using strpos would be better, but since you like
    //low-memory...
    do {
        $markerIndicator = fread($fp, 1);
        if ($markerIndicator === "\xFF"){
            $marker = fread($fp, 1);
            if ($marker !== "\x00"){
                return ord($marker);
            }
        }
    } while (!feof($fp));

    throw new \RuntimeException('Marker not found');
}

jodunno · March 8, 2023

24 minutes ago, kicken said:

It's pseudo-code, it doesn't search for anything specific

well, it's a bit more than pseudo-code that i am used to seeing. I always see if var then do something. Anyway, point is obviously destroyed by your code since you check it in the function. I made a similar function today and i had to add the null byte check because my page was filled with null bytes in addition to the markers. I looked at your pseudo code again but i cannot see your function. Honestly, i didn't expect you to write a complete scanner. I thought that you were jotting barebones code. My apologies to you.

I actually uploaded a small photo to peak at the output. I hope it's ok. I edited the photo to a smaller size and saved it at 96%. I had to find one of my nature photos on this laprop because my pc is shutdown already. After i found the photo i wanted to upload it but the link to the page was gone. I had to look through my fiddler proxy to find the uri again. LOL

Thank you for the link. My code is similar to your in certain ways but i am missing several useful aspects. I also do not know the unpack parameters. Actually, i have not used unpack until today. I have no experience with it, so this code is beyond my understanding at this time.

I have to go to bed soon, i have a course in the morning and i need to sleep. I will look over your code only after i finish mine. I do not seek code written for me so i will keep working on my own code until i have a final version. Then we will see how bad it is. I am an honest man, so i promise no more peeking at your code. I have not memorized anything. I just remember a strange unpack parameter but i forget what it was because memorizing it is cheating. I am not a cheater. I will get back to my code tomorrow after my course.

Goodnight and Best wishes, John.

jodunno · March 9, 2023

Hi Kicken,

I worked on my code for 1.5 hours today after my course (I am enrolled in a German language course, since i live in Germany. I am not German. My Wife was born here, so i am here to be with her and live happily ever after.) I collected all of the data in the main loop and i placed it in an array to verify it. My array is 560 something keys. I do not plan to store the data in an array. I plan to use a single array to store a single marker's data for analysis, then release it (a temp array). Our code is clearly different. I only had four hours of sleep last night. I had a migraine when i logged off. I am so tired. I had trouble processing German language today. I am going to relax today. I am too tired to code. I will work on it this weekend, then i will post my code for critique. Then i will look at your code and see how i could make my code better, faster, smarter, stronger ☺️

I will definitely have to see what you are doing with unpack but i will not look at your code until my code is complete. I will, however, read about unpack a bit more online.

Best wishes, John

jodunno · March 10, 2023

Hello Kicken,

I wanted to work on my code today since it is Friday. I can stay awake late tonight (shaBang!) However, i have to rewrite my code because it wasn't reading the jpeg images that come from my camera. Now i know what you mean about being too specific. Anyway, i have rewritten my code and it is reading all of the markers (as it should do so). However, the photos that come from my camera have multiple entries of the same markers. What is that supposed to be? is it possible that it is reading scan data or restart data? i have not dumped the data yet to compare it. I had to research online how to tell where the pointer is at in the file. I found the ftell function, which is pretty cool if you ask me. I'm sure you already know this function as you are a PROgrammer. I am a hobbysist for now, so i was unaware of this function.

Anyway, i used ftell to show the position in the read and the positions are different. How should i handle this?

Maybe a better question is why are two entries for all of the markers?

i don't want to look at your code. I want to do it myself and get a working script. Then i will look at your code.

Best wishes, John

jodunno · March 10, 2023

well that was stupid! i forgot to check for the 0xD9 EOI marker.

so here is my collect the markers code. I will look at your code after you critique mine. I'm sure it lackluster to your code since i am a hobbyist for now but it works. The code is missing security and failure checks because it is test/in dev code.

<?php
    $SID_filePointer = fopen('Canon_PowerShot_S40.jpg', 'rb'); //DSC_0001
    $SID_JPEGmarkers = (array) ["\xFF\xD8" => 0, "\xFF\xE0" => '', "\xFF\xE1" => '', "\xFF\xDB" => '', "\xFF\xC0" => '', "\xFF\xC4" => '', "\xFF\xDA" => '',  "\xFF\xD9" => 0];

    function checkMarker($fileCharacter) {
        return in_array($fileCharacter, ["\xE0", "\xE1", "\xDB", "\xC0", "\xC4", "\xDA", "\xD9"], true) ? true : false;
    }
    switch (fread($SID_filePointer, 2)) {
        case "\xFF\xD8": // SOI found
            $SID_JPEGmarkers["\xFF\xD8"] = 1;
            while (!feof($SID_filePointer)/*.*/) {
                $SID_marker = fread($SID_filePointer, 1);
                switch ($SID_marker) {
                  case "\xFF":
                      if (checkMarker($nextMarker = fread($SID_filePointer, 1)) === false) { break; }
                      echo dechex(ord($SID_marker)) . dechex(ord($nextMarker)) . ' ' . ftell($SID_filePointer) . '<br><br>';
                      // i figure that i can mark the markers with ftell then collect the data with a seek and dump read for a speedy process
                      if ($nextMarker === "\xD9") { $SID_JPEGmarkers["\xFF\xD9"] = 1; break(2); }
                  break;
                }
            }

        break;
        default:
            echo 'The file is not readable.'; 
    }

    fclose($SID_filePointer);

By the way, my cameras produce EXIF Jpeg images which lack the JFIF signature.

Edited March 10, 2023 by jodunno
code optimization update

kicken · March 10, 2023

52 minutes ago, jodunno said:

Now i know what you mean about being too specific. Anyway, i have rewritten my code and it is reading all of the markers (as it should do so).

Better, though you're still a bit too specific in that you're looking for specific marker sequences instead of the more generic 0xFF0x?? (where ?? is no 00) sequence. You want to find the generic sequence to find all markers. Then you can just ignore the ones you're not interested in.

59 minutes ago, jodunno said:

However, the photos that come from my camera have multiple entries of the same markers.

I don't believe there's anything in the format that says there cannot be multiple. In fact, the example given in the reference above shows multiple huffman table markers (0xFFC4). Since you're not yet scanning properly for the marker and image data (which is fine, baby steps remember) you might be getting some false markers in your output. I had to make a small update to my code since posting it here as it was incorrectly seeing reset markers as the end of image data instead of ignoring them.

1 hour ago, jodunno said:

Anyway, i used ftell to show the position in the read and the positions are different. How should i handle this?

I'm not sure what you mean here. ftell gives you the current offset of the file pointer. In most cases, you don't need that. My code uses it (along with fseek) to reset the file pointer when scanning the image data, but other than that it's not necessary.

1 hour ago, jodunno said:

i don't want to look at your code.

You should feel like looking at example code is somehow cheating. You should look at it, but rather than just copy/paste it try to determine how it works then re-implement that logic yourself. Maybe you end up with the same code, maybe not. The key is to get an understanding of the logic and what the code is doing. One of the first things I do when digging into a new library aside from reading the docs is to try and find example code I can look at. That's one of the great things about Github, finding and viewing example code.

jodunno · March 10, 2023

57 minutes ago, kicken said:

Better, though you're still a bit too specific in that you're looking for specific marker sequences instead of the more generic 0xFF0x?? (where ?? is no 00) sequence.

I'm sorry but i do not see the difference between 0xFFx?? and checking it in the array. Do recall that the word specific is in the word specification. The markers are known. We are not implementing an algebraic expression seeking an unknown number. I am getting marker data, so i am not understanding what is the matter. Is the marker data different using my code than the output using your code? Maybe i am tired from a long week. I am missing data that indicates my markers are incorrect.

I also do not know what you mean about ftell. ftell is telling me the position of the marker, therefore we can use that to collect the data, no? I haven't verified the data yet compared to yours. C0 is supposed to hold 17bytes of data. Then let us examine my thoughts in code (to be placed after the while loop to find the markers and before the fclose statement.

    fseek($SID_filePointer, $ftellPositions[2]);
    $c0 = str_split(fread($SID_filePointer,  ($ftellPositions[3] - 2) - $ftellPositions[2]));
    foreach ($c0 as $key => $data) { echo dechex(ord($data)) . ' | '; }
    echo ' => ' . count($c0);

one fast pass through the characters yields marker positions. A dump of the position (ftell) yields the marker data.

I am done coding for today, my head is in vertigo.

Sign In

seeking tips with reading files

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information