Wednesday, 19 September 2012

Hiding The Dead Revisted

In a previous POST, I looked at automating the detecting of encrypted data.   Assuming that you find such data...then what?   You will need to know the software that opens the cyphertext, the password and possibly a key.   There is some stuff we can do to try and establish these parameters. Bearing in mind we have ALREADY done a file signature check on our system, we can use this data to help us.

First lets look at some other routines, to try and detect encrypted data.   Some programs create cyphertext with a recognisable signature, you simply need to add those signatures to your custom magic file, which you will store under the /etc directory.

Some crypto programs generate the cyphertext file with a consistent file extension.
We can search our file signature database for those with something like this, assuming our database is called listoffiles and saved in the /tmp directory:

awk -F: '{print $1}' /tmp/listoffiles | egrep -i '\.jbc$|\.dcv$|\.pgd$|\.drvspace$|\.asc$|\.bfe$|enx$|\.enp$|\.emc$|\.cryptx|\.kgb$|\.vmdf$|\.xea$|\.fca$|\.fsh$|\.encrypted$|\.axx$|\.xcb$|\.xia$|\.sa5$|\.jp!$|\.cyp$|\.gpg$|\.sdsk$' > /tmp/encryptedfiles.txt

In this example we are asking awk to look at the file path + file name field in our database and return only the files with certain file extensions associated with encryption.

We can also detect EFS encrypted files, no matter what size.   We could use the sleuthkit  command "istat" to parse each MFT in the file system and search that for the "encrypted" flag there.   However, this is going to be very time consuming, a quicker way would be to simply look at our file signature database.   If you try and run the file command on an EFS encrypted file, you will get an error message, the error message will be recorded in you signature database.   This is a result of the unusual permissions that are assigned to EFS encrypted files, you can't run any Linux commands against the file without receiving an error message.   I have not seen the error message in any non-EFS encrypted files, so the presence of this error message is a very strong indicator that the file is EFS encrypted.   We can look for the error message like this:
awk -F: '$2 ~ /ERROR/ {print $1}' /tmp/listoffiles > /tmp/efsfiles.tx
So, we have now run a number of routines to try and identify encrypted data, including entropy testing on unknown file types, signature checking, file extension analysis and testing for the error message associated with EFS encryption.

For EFS encrypted files and encrypted files with known extensions, we can figure out what package was used to create the cyphertext (we can look up the file extensions at   But what about our files that have maximum entropy values?
First, we might want to search our file database for executables associated with encryption, we could do something like this:

awk -F: '$1 ~ /\.[Ee][Xx][Ee]$/ {print $0}' $reppath/tmp/listoffiles | egrep -i 'crypt|steg|pgp|gpg|hide|kremlin' | awk -F: '{print $1}' > /tmp/enc-progs.txt

We have used awk to search the file path/name portion of our file database for executable files, sent the resulting lines to egrep to search for strings associated with encryption, then sent those results back to awk to print just the file path/name portion of the results and redirected the output to a file.    Hopefully we will now have a list of programs of executable files associated with encryption.

We can now have a look for any potential encryption keys.   We have all the information we need already, we just need to do a bit more analysis.  Encryption keys (generally!) have two characteristics that we can look for:
1)  They don't have a known file signature, therefore they will be described as simply     "data" in our file data base.
2)  They have a fixed size, which is a multiple of two, and will most likely be 256, 512, 1024, 2048 bits...I emphasise BITS.

So our algorithm will be to analyse only unknown files to establish their size and return only files that 256,512,1024 or 2048 bits.   We can use the "stat" command to establish file size, the output look like this:

fotd-VPCCA cases # stat photorec.log 
File: `photorec.log'
Size: 1705            Blocks: 8          IO Block: 4096   regular file
Device: 806h/2054d      Inode: 3933803     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-09-19 11:52:48.933412006 +0100
Modify: 2012-09-19 11:52:12.581410471 +0100
Change: 2012-09-19 11:52:12.581410471 +0100

The important thing to remember is that the file size in the output is in BYTES, so we actually need to look for files that are exactly 32, 64, 128 or 256 BYTES in size, as they map to being 256, 512,1024 or 2048 BITS.

So our code would look something like this:

FILESIZE=`stat $1 | awk '/Size/ {print $2}'`
if [ $FILESIZE = "64" -o $FILESIZE = "128" -o $FILESIZE = "256" -o $FILESIZE = "32"  ]
        echo $1 >> /tmp/enckeys.txt

awk -F: '$2 ~ /^\ data/ {print $1}' /tmp/listoffiles > /tmp/datafiles.txt
cat /tmp/datafiles.txt | while read i ; do KEYREC $i ; done  

The last two lines of code search the description part our list of files and signatures for unknown files (using the string \ data as the indicator) and sends just the file path/name for those results to a file.   That file is then read and every line fed into a function that runs the stat command on each file, isolates the "Size" field of the output and test whether the size matches our criteria for being consistent with encryption keys.   Now you will get some false positives (only a handful) by looking that those files in a hex viewer you will be to eliminate those files that aren't encryption keys - if they have ascii text in them, then they aren't encryption keys!

Now we have the cyphertext, the program used to decrypt the data plus the key, all we need now is the password.  We are going to again use the tool every forensicator MUST have, bulk_extractor.   One of the many features of bulk extractor is to extract all of the ascii strings from a hard drive and de-duplicate them leaving us with a list of unique ascii strings.   It may well be that the users crypto password has been cached to the disk - often in the swap file.   We will probably want to do the string extraction at the physical disk level, as opposed to the logical disk.   We will need several gigabytes of space on an external drive as the list of strings is going to be very large, the command to extract all the strings on the first physical disk and send results to an external drive mounted at /mnt/usbdisk would be:

bulk_extractor -E wordlist -o /mnt/usbdisk /dev/sda

We need to do a bit more work, we can't realistically try every ascii string that bulk_extractor generates.  The password is likely to be long, with mixture of upper/lower case characters + numbers.   You can use a regex to search for strings with those characteristics to narrow down the number of potential passwords (Google is your friend here!).

So, detecting cyphertext and encrypted compressed archives, identifying potential crypto keys and potential passwords is doable with surprisingly small amount of code.    For those on a budget, this solution costs about 10 pence (the cost of a recordable CD) if you want to use the suspect's processing power to do your analysis.


Wednesday, 12 September 2012

Saturday, 8 September 2012

Friday, 7 September 2012

Copying the dead

In a previous POST we looked at doing a file signature check on all the files in the live set, then using awk to search the file descripter field in the resultant database.
Once we do our awk search we send the results to a text file, so now we need to know how to use that text file to further process our results.

Lets imagine we have used searched for the string "image data" in our file description field to identify all our graphic files, and created a text file of the fine names and file paths in a file called live_pics.txt in our /tmp folder with this command:

awk -F: '$2 ~ /image\ data/ {print $1}' listoffiles > /tmp/live_pics.txt

One thing you definitely DON'T want to do is use the text file in a loop and the cp (copy command) to copy data out, like this:

cat /tmp/live_pics.txt | while read IMG ; do cp $IMG  /mnt/case/images ; done

Our command reads the live_pics.txt file, line by line and copies each file out to a single directory on an external drive mounted at /mnt/case.  The reason we don't do this is that if we have two files with the same name in our live_pics list (but in different directories) then the cp command will copy out the first file but then overwrite it with the second file - because a file with that name already exists in our receiving directory.  Also, if a file name in our list of pictures, happens to start with a "-" character then the shell will interpret the remainder of the string as an option to the cp command resulting in an error message.  In addition, if there is any white space in the file path or file name, the shell will assume that that is the end of the line, and fail to copy the image out.  Here is my solution to the problem; I use a function that checks to see if a file with the file name already exists, if so it appends [1], [2] etc to the file.  I had to overcome my fear and loathing of perl to introduce a perl regular expression for checking if the file name already exists.  I set the Internal Field Separator environmental variable ($IFS) to a newline, thus the function uses a new line character as a marker for the end of a line (ignoring the white space in any file paths). I also include a "--" after the -p option to let the shell know that we have finished with our options. Here is the function and a few lines of code to show how you would use the function:

filecp () {
filename=`basename "$1"`
     while [ -e $dir/"$filename" ]; do
            filename=`echo "$filename" | perl -pe 's/(\[(\d+)\])?(\..*)?$/"[".(1+$2)."]$3"/e;'`
 cp -p -- "$filepath" $dir/"$filename"

cat /tmp/live_pics.txt | while read IMG ; do filecp $IMG ; done
unset IFS

Obviously the same principle applies to any list of files that you want to copy out from the file system, so the code can be integrated into any of your scripts in your previewing system.  If you haven't been using some of the defensive programming techniques in this code when using the cp command, you really need this code!

Understanding the dead

Crimminey!  All this stuff about code pages and unicode and UTF is about as comprehensible as the unearthly groaning and hissing that I use to verbally communicate.   Let your humble dead forensicator try and put it all into some semblance of order - we need to know this for when I talk about "keyword searching" in a future post.   We will dispel some myths along the way, things that have been wrongly suggested to me as being true, over the years.

Lets skip computing very early history and jump to the early IBM machines.
They used punch cards to store data, the most widely adopted punch card had 80 columns and 12 rows.   The card was punched at intersections of the rows and colums, the locations of the punches represented numbers and letters.   However, the re was no need to distinguish between upper and lower case letters as there were no word processing packages (Microsoft Office Victorian never really took off I guess).  Only a small set of other characters needed to be represented.   As a result, the character range that was represented was only 47, 0-9 A-Z and some other characters.  So at this point in time, the concept of binary storage was absent from computers.   All that changed with in the mid-1950s with the advent of IBMs computer hard disk.  The first model the IBM 350 had 30 million BIT capacity.
The cost of storing a single BIT was eye-wateringly expensive.  There was a necessity to not only move to the concept of binary encoding, but also to ensure the smallest data unit (later named a byte) was as efficient as possible.   The old punch cards could represent 47 characters, all these characters could be represented with just 6 bits in a data unit, so IBM grouped together their bits into groups of 6 to form their data units.   Underlying these groupings was the concept of a truth table.
Each group of 6 bits was assigned a numeric value and a truth table was consulted to see what character the numeric value represented - this concept is still used today, it is often referred to as "character encoding".

IBM soon spotted the limitation of 6 bit groupings, especially when they came to thinking about word processing, simply adding lower case letters to the truth table would require 73 characters, that is before you even the consider the additional punctuation marks.  This lead to 7 bit groupings which were introduced on the IBM 355 hard disk.   This meant that 127 characters could be represented in the truth table for a 7 bit scheme - all upper/lower case characters, punctuation marks, the main math characters could easily be represented, with still some space in our truth table for more at a later date.

Looking back, we can laugh and ask why they didn't simply use an 8 bit byte, as 8 is a power of 2 and thus much more binary-friendly.   Well the reality was that the IBM 355 stored 6 million 7 bit groupings.  The cost of storing a single bit was in the 10s of dollars region.   Moving to an 8 bit grouping would mean that the 8th bit seldom get used (there were already unclaimed slots in our 7 bit truth table) resulting in 1000s of dollars of redundant storage.   There was a conflict here between the coders and the engineers, using 7 bit bytes made for less efficient code, using 8 bit bytes made for inefficient storage.   Someone had to win, it was the coders who emerged glorious and victorious once the fog of battle had cleared.   Thus the term byte was coined and 8 bit grouping settled on.

Interestingly IBM actually experimented with 9 bit bytes at some point to allow error checking.   However, 8 bits groupings won on the basis of efficiency.   So, historically there has been many battles regarding the optimum number of grouping bits into data units.   This explains why IP addresses and the payload of internet data packets are encoded as "octets" - they earlier developers were explicitly stating that the bits should be put into 8 bit groupings, as opposed to using any of the other  approaches to grouping bits then in existence.

8 bit bytes were settled on, but it was important that the truth table for these bytes were standardised across all systems.  Much like the VHS VS Betamax war (those of  you, like myself, born at a more comfortable distance from the apocalypse will recognise that I am talking about video tape player standards here), there were two main competing systems for being the agreed standard for establishing the truth tables.  EBCDIC, which was an extension of an older code called BCDIC, which was a means of encoding those 47 characters on the old punch cards with 6 bits.  EBCDIC was a full 8 bit character encoding truth table.  The main competing standard was ASCII (American Standard Code for Information Interchange).

MYTH No 1.  ASCII is an 8 bit encoding scheme.
Not true, it wasn't then and isn't now.  It is a 7 bit encoding scheme.

The fact that EBCDIC used the same number of bits as the newly defined 8 bit byte, may explain why for most of the 1970s it was winning the war to be the standard.   Ultimately, ASCII offered more advantages over EBCDIC, in that the alphabet characters were sequential in the truth table and the unused 8th bit could be utililised as a parity bit for error checking, thus ASCII became the accepted standard for character interchange coding.

So now we had our 8 bit data transmission/storage unit, known as a byte and an agreed standard for our truth table - ASCII.   As ASCII used 7 bits,  128 characters could be defined.  The first 32 characters in the ASCII truth table are non printable e.g Carriage Return, Line Feed, Null, etc, the remaining character space defines the upper/lower case letters, the numbers 0-9, punctuation marks and basic mathematical operators.

All would have been well if only the English speaking world wanted to use computers...obviously the ASCII table allowed for English letter characters.  Unsurprisingly, citizens of other nations wanted to use computers and use them in their native language.  This posed a significant problem, how to transmit and store data in a multitude of languages?  English and a lot of Western European languages use the same characters for letters, they are Latin Script letters.   Even then, there were problems with cedillas, accents and umlauts being appended to letters.  Things got more complicated moving into central Europe and Eastern Europe, where Greek Script and Cyrillic script are used to represent the written form of languages there.
The solution was to use that 8th bit in ASCII, which gave 128 additional slots in the truth table, developers could define characters for their language script in using those slots.   The resultant truth tables were known as "code pages".

Myth 2: Code pages contain only non-English language characters.  
Not true, most code pages are an extension of the ASCII standard, thus the code pages contain the same characters as the 7 bit ASCII encoding + up to 128 additional language characters.

The implication here is that if I prepare a text file with my GB keyboard, using English language characters and save the file with any of the code pages designed for use with, say cyrillic script, then the resultant data will be the same as if I saved the file using plain old ASCII.  My text only uses characters in the original ASCII defined character set.   So if you are looking for the word "zombie" on a data set, but don't know if any code pages have been used, no need to worry, if the word was encoded used any of the extended ASCII code pages, you don't have to experiment with any code page search parameters.

Myth 3:  The bad guys can hide ASCII messages by saving them in extended ASCII code pages.  Not true..see above!

So, the use of code pages solved the problem of encoding some foreign language characters.  I say "some" because there are a number of languages that have far more characters in their alphabet that can be stored in the 128 slots at the upper end of the ASCII table.  This fact, coupled with the near impossibility of translating some code pages into others, lead to the development of unicode.  Unicode is essentially a super set of all the code pages.  It is a truth table that seeks to assign a numeric value for every "character" in every language in the world, in addition the truth table also includes symbols used in other fields such as science and mathematics.   Each slot in the unicode truth table is referred to as being a "code point".  The concept of "characters" also needs extending. Many of the "character" renderings in Arabic script are dependent on the rendering of characters either side of them.   In other languages, several letters may be combined to form a single character.  These combined letters are referred to as being "glyphs", the unicode standard emphasises the concept of glyphs.   It follows that if your software has looked up the numeric value of a "character" in the unicode truth table, then that software (or your O/S) must have the corresponding glyph installed to display it the "character" correctly to the user.

Well in excess of 1,000,000 code points exist in unicode.   Fundamentally, then, computers perform word processing (and other operations involving character strings) with numbers.  Those numbers are code points, the code point is looked up in the unicode truth table and the corresponding glyph displayed on the screen.  Those numbers also need to be represented in binary.  No problem...we could represent all of the numbers (or code points) in our unicode truth table with 3 bytes. However, this approach is RAM inefficient.   Unicode divides it's full repertoire of characters in "planes" of 65,536.  The first plane contains all the characters for modern languages, therefore we only need 2 bytes to represent those, we are wasting one byte per glyph if we use a 3 byte value. The problem is even worse when dealing with the English characters, their code points are at the start of the unicode truth table, (making this part of unicode backwardly compatible with ASCII), so only need 1 byte to represent a character - 2 bytes are therefore being wasted using a 3 byte scheme.   What was needed were schemes to encode those numbers (code points!), the most popular currently in use are:

Myth 4: UTF-8 uses 1 byte, UTF-16 uses 2 bytes, UTF-32 uses 4bytes.  
Not entirely true.  UTF-16 does use a 16 bit encoding scheme, UTF-32 uses a 32 bit encoding scheme but UTF-8 is a variable length scheme, it can use 8 - 32 bits.

There are pros and cons for each encoding method, UTF-32 means that all characters can be encoded with a fixed length value occupying 4 bytes.   However, this is inefficient, for the previous stated reasons.   UTF-8 is a variable length encoding.  If the encoded code point value is for an ASCII text character then the first bit is set to zero, the remaining 7 bits (we only need one byte for ASCII !!), are used to store the value of the code point.  It is fully backward compatible with the orginal ASCII standard. Thus if you are searching for an English language string in a UTF-8 data set then you don't need to set any special parameters with your search configuration.
If the UTF-8 encoded character requires more than 1 byte then the first bit(s) are set to reflect the number of bytes used in the array (which can be 2,3 or 4 bytes).  Thus in a two byte array, the first 2 bits are set to 1, in a 3 byte array the first 3 bits are set to 1, and so on.
UTF-16 encoding use a fixed 16 bit array, however for characters in the above the first plane of Unicode, two pairs are needed.
Conceptually it is quite straight forward, however, with UTF-16 and UTF-32 there is an issue of endian-ness, should the encoded value be in big endian or little endian?
One approach is to use Byte Order Marking, this is simply a pair of bytes at the start of the encoded data that indicate the endian-ness being used.

From a programmers point of view, you want your application to know what encoding scheme is being used to encode the characters.  Your average user doesn't really care as long what encoding scheme is being used as long as the string is being rendered correctly.   The problem is for us forensicators who given a large dataset (especially that stuff in unallocated space), really need to know what encoding schemes are being used as this will greatly affect their keyword searching strategy. There are no fixed signatures in a stream of characters that are encoded with the old school code pages.   Certainly you can experiment by viewing files in different code pages to see if the resultant data looks like a valid and coherent script - some of the forensic tools can be configured to do this as part of their analysis routines.   The forensicators in the English speaking world (and those using the a written language that is based on Latin Script) have got it fairly easy.  But what about if your suspect is a Foreign National?   Even if they have the English language version OS installed on their machine, and English character keyboard and the English language word processing package, this doesn't exclude the possibility of them having documents, emails, or web pages in a foreign language on their system.

Myth 5: You can find the word "hello" in a foreign language data set by typing the string "hello" into your keyword searching tool and selecting the appropriate code page for the relevant foreign language.   
This is so wrong, and I have heard this view being expressed several times.  

You need to understand that when doing keyword searching, programmatically, you are actually doing number searching, those numbers being the code points in a particular truth table.  It follows that first of all you need to know the type of truth table used to encode the characters, but before you do that you need to translate the english word you are looking for (hello) into the specific language you are looking for (and there may be several different possible translations).   In a non-latin script based language there will be no congruence between the old school code pages and the new school unicode encoding scheme.  So you will need to search for the code point numbers in UTF-8, UTF-16, UTF-32 and any relevant code pages.   There are so tools, particularly e-discovery tools that advertise that they can detect foreign language based data, but this is different from detecting the encoding method used.
All in all, foreign language detection in forensics is challenging to say the least.  So many issues about digital forensics are discussed, yet I see very little around foreign language handling and character encoding.   If you have any experience in this field then please feel free to comment.

One of the programs on my previewing system will attempt to identify the language used in non-English Office documents, PDFs, emails and web-pages, it will also output the data in plain text so that it can be copied and pasted into google translate to get at least an idea of the theme of of the data.   I will post that code in a future posting, for now any comments about this whole topic will be gratefully received.

Monday, 3 September 2012

Identifying the dead redux

In my previous POST I showed you how to do a file signature check on every file in the file system.   That should result in us having a flat, text based database that looks have this format:

/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/ Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators

The format of the file is two fields separated by a colon.   The first field is the full path and filename, the second field (after the colon) is a description of the file taken from the magic file.  Notice that the file command can identify ASCII text files.
Remember that our database is a file called listoffiles, it is in the path $reppath/tmp/.
If we wanted to create a list of graphic files then we could use a command like this:
grep 'image data' $reppath/tmp/listoffiles | cut -d: -f 1 > $reppath/tmp/liveimagefiles.txt

The above command is searching each line of our file list for the string "image data", then piping the results to the cut command which prints the first field of any matched lines, so the command returns the following:


The inestimable Barry Grundy points out that this is only useful if you ensure that the description of all  the graphic files in your magic file contains the string "image data".  You should NOT edit your system magic file, simply create your custom magic file, the format is simple, and place it in your /etc/ directory.

In the example above, everything works fine, but now consider this data set.

/media/sda1/stuff/TK8TP-9JN6P-7X7WW-RFFTV-B7QPF: ASCII text
/media/sda1/stuff/mov00281.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/cooltext751106231.png: PNG image data, 857 x 70, 8-bit/color RGBA, non-interlaced
/media/sda1/stuff/ Zip archive data, at least v2.0 to extract
/media/sda1/stuff/th_6857-1c356e.jpg: JPEG image data, JFIF standard 1.01
/media/sda1/stuff/.Index.dat: data
/media/sda1/stuff/mov00279.3gp: ISO Media, video data, MPEG v4 system, 3GPP, video data
/media/sda1/stuff/phoneid.txt: ASCII text, with CRLF line terminators
/media/sda1/stuff/image data.doc: Microsoft Office Document

Using our previous command we now get these results:

/media/sda1/stuff/image data.doc

The final line is a false positive, we can avoid this happening by only searching the second field (the file description) of each line of file list and printing the first field of any matches.   We can use the awk command to do this.  In its simplest form awk will break down a line of input into fields.   By default the field separator is white space, however you can change the field separator by using the -F option, then select which field to print with the print option.   So, we could print all the file paths and names (the first field in our listoffiles with this command:
awk -F: '{print $1}' $reppath/tmp/listoffiles

If we wanted to print just the descriptions (the second field) then we could use this:
awk -F: '{print $2}' $reppath/tmp/listoffiles
We can see that awk uses the nomenclature $1, $2, $3 etc to name each field in the lines of input.

If we wanted to search the second field for our "image data" pattern then we could use this command:
awk -F: '$2 ~ /image\ data/ {print $1}' $reppath/tmp/listoffiles

We use the -F option to set the file separator to a colon, we put our statement in single quotes, the $2 ~ means go to the second field and match the following pattern  (which has to be inside the / / characters) then print the first field of any matching lines.   The white space in our pattern to be matched has to be escaped with a \ character.   We can redirect the output to a files, meaning that we have a complete of all the graphic file paths and names in a single file.   We can read that list of graphic files to copy out our graphic images...but there are some problems we need to overcome....

Identifying the dead

When previewing a system, once you get down to the file system level, the first thing you might want to do is create a database of live files and what those files actually are - what is known as a file signature analysis.   Linux generally relies on file signatures to identify file types, rather than file extension (which is the Windows way).
Therefore there is already a an expansive file signature database installed on all Linux systems named "magic" (often found at /usr/share/misc).   In addition you can create you own magic file with your customised signatures.    The magic file is used extensively by the "file" command.  The file command compares the signature in a regular file to the signatures in the magic database(s) and returns a description of the file based on the content of the magic file.   Most forensicators understand the significance of doing a file signature check as opposed to relying on file extension.
A user can change a file extension, the firefox web browser caches files to its web cache with no file extension, the opera browser caches files with a .tmp file extension.   So if you are viewing graphic files via your forensic suite, or even if you have exported them to view in a file browser, have you checked whether your tool of choice is working on the file extenison or file signature?   It may be that you are missing 1000's of images in, say, the firefox cache based on faulty assumptions...make sure you check, chaps!   Sorting files by file extension is about as effective as aiming for the centre of mass on a zombie in the hope of stopping them.   Remember:  Head shots to destroy zombies, file signature analysis to identify file types.  Simples!

Anyway, we can start to create our previewing script.  Remember with my file system mounting SCRIPT all the file systems were mounted under the /media node, thus /dev/sda1 gets mounted at /media/sda1, /dev/sda2 at /media/sda2 etc etc.   The script also initiates a loop to process each mounted file system in turn and exports some variables that can be used by our processing script.  Also, our external drive is mounted at /mnt/cases.  Here is the relevant part of the script:

for i in `cat /etc/mtab | grep media | egrep -iv 'cd|dvd'| awk -F\ '{print $2}'`
export volpath=`echo $i` #eg /media/sda1
export suspart=`echo $volpath | sed 's/media/dev/g'` #eg /dev/sda1
export fsuspart=`echo $suspart | sed 's/\///g'` #eg devsda1
export susdev=`echo $suspart | sed 's/[0-9]*//g'` #eg /dev/sda
export tsusdev=$susdev #eg /dev/sda
dirname=`echo $suspart | sed 's/\//_/g'` #eg _dev_sda1
ddirname=`echo $susdev | sed 's/\//_/g'` #eg _dev_sda
sudo mkdir -m 777 $evipath/$csname/$evnum/$ddirname #eg /mnt/cases/BADGUY_55-08/ABC1/_dev_sda1
sudo mkdir -m 777 $evipath/$csname/$evnum/$ddirname/$dirname
cd $evipath/$csname/$evnum/$ddirname/$dirname
export reppath=`pwd` #eg /mnt/cases/BADGUY_55-08/ABC1/_dev_sda1
sudo mkdir -m 777 findings
sudo mkdir -m 777 Report
sudo mkdir -m 777 tmp


First thing to note is that Forensicator has been a bad zombie by putting his variables in lower case, it is much better programming practice to put them in upper case to make the code easier to read.
The first line intitiates our loop, it is isolating all the file systems mounted under /media by looking in the /etc/mtab file, then excluding our cd/dvd drive in case we have booted the system from CD (as opposed to thumb drive).   From line 9 onwards it is referencing some variables called $evipath, $csname, $evnum.   If you look earlier in the diskmount script you will see that they were created during an interactive session, when the user was prompted for input, like this:

export evipath=/mnt/cases
echo -n "What is the case name (NO SPACES OR FORWARD SLASHES)? > "
read csname #eg BADGUY_55-08
export csname
echo -n "What is the evidence number of suspect system (NO SPACES OR FORWARD SLASHES)? > "
read evnum #eg ABC1
export evnum
echo -n "What is your rank and name? > "
read examiner #eg DC_Sherlock_Holmes
export examiner

The "read" command is great for getting user input assigned to a variable, the value is then exported for use by other scripts.   I have commented the code (anything after the # character) to show you an example of the what the variable value looks like.   So, we have a mounted external drive, we have created a case directory structure on it, the topmost directory is the case name, the next directory down the tree is that of the evidence number, inside that will be a directory for each physical device eg. _dev_sda, inside that there will be a directory for each partition (_dev_sda1, _dev_sda2, etc etc), inside each of those will be 3 directories named "findings", "Report" and "tmp".  We have also, as part of our loop, created at variable called $reppath (short for Report Path), this variable points at the partitions directory on our ouput drive, so that data can be sent to the findings|Report|tmp directory, an example of a $reppath variable value would be something like:

If I wanted to create a database of all the files and their description for each partion the code would therefore be:

find $volpath -type f -exec file {} \;  >> $reppath/tmp/listoffiles

The $volpath is the mounted partition, eg /media/sda1.   The database is called listoffiles (it is just a simple text file), when the $reppath variable gets expanded the full file name and path would be something like:


The syntax for the find command is a bit weird if you aren't familiar with all the options, the command is saying find all entities in the path (for instance) /media/sda1,
confine the results to regular files ( -type f), execute the file command (-exec file) for each entity found ( {} ) and redirect results to my listoffiles.

This is how I would do the file signature check, once this database is created, I script out interrogating the database for certain file types then processing those.  The code I have, and will be publishing here, will process the live set, deleted set and unallocated space along with the interpartition gaps and ambient data such as swap/hiberfil/memory dumps, it will export various files out for review, hunt for encrypted files, process compressed data and various archive formats, create storyboards of any movie files, do virus checking, recovers and processes 25 different chat/messaging formats, processes all the major email formats, processes p2p history files, does complete URL recovery and analysis, and lots more.   This is all done automatically with a single command.

Sunday, 2 September 2012

A plague of previewing...the rising

So you have created your preview disk, booted the system and now want to start your previewing.  You can use some forensic tools such as the SLEUTHKIT to do some analysis, but we might want to use a lot of the tools built into linux...for that we need to mount the file systems in a forensically sound way, and probably mount some external storage to send our exported data to.

There are a number of fantastic tools available to us in linux for discovering physical disks, partition table and files system information, before we mount those file systems.
We will also want to know what exists in the gaps between there raw data or maybe a complete, but deleted, file system?  If a file system fails to mount, we would want to know the file system encrypted? If the file system is bit-locker key, maybe we want to scan the partition for a recovery key? We will also want to mount an external drive in read/write mode to collect any results from out analysis. This can be quite a challenging task at the terminal, so I have written a script that will automatically perform most of the above tasks in seconds.   You can add this script to your boot CD.  Note there are a couple of dependencies that should be available in your systems packet manager to install, the deps are:
gdisk (to handle GPT partitions)
hachoir tools set
hdparm (to get detailed info about IDE hard disks)
sdparm (to get detailed info about SATA/SCSI hard disks)

So this will be the flow of our script:
Check if the user is root.
Get a list of partitions
Get list of mounted partitions - we need to do this in case we have booted using
a thumbdrive, we want to exclude our boot media from the analysis.
Get list of physical disks (in case we want to do anything at the physical disk level)
Check for presence of GPT partition tables.
Check each partition for presence of bitlocker signatures.
Mount all mountable partitions under the /media node in Read Only mode.
Check for the presence of any AppleMac partitions, and mount any partitions found.
Any file system that fail to mount, conduct entropy test on a sample of 5mb of data from the partition - warn user if the file system appears to be encrypted (I will cover this in a bit more detail in a future post).
Create mount point for external drive
Prompt user to plug in external drive
Detect and mount partition on external drive in Read/Write mode
Create case directory structure on external drive
Create report containing details of interrogated system, hard disk info, partition info, RAM/Processor info
Process each mounted partition in turn (We can do a simple check to determine if a Windows or MacOS is present, then launch another script appropriate for those OS.  If no OS detected, assume it it is a storage disk and process accordingly.
Once each mounted partition processed, image in turn each inter-partition gap to external drive.
Analyse is ip gap, to see if it has a valid file system, if so, mount the file system and process, else treat as raw data and process with different script.
Check for presence of Linux swap partitions and process same
If any bit-locker signatures found, launch process to look for recovery key on drive.

In the script you will see that I have commented out the lines to launch the various analysis scripts for each partition.   We'll look as some of the interesting types of automated analysis that we can do in future posts.  For now, you can find the shell script HERE.

You may wonder why I use the "loop" option to the mount command, this is to prevent the journal in some journalling file systems being MODIFIED.

The inter-partition gap analysis is something that isn't always done in computer forensics, the layout of many computer forensic suites don't lend themselves to easy analysis so performing this analysis is often overlooked.  If you aren't looking in the inter-partition gaps routinely then you are doing you analysis incorrectly.