Wednesday 2 January 2013

Traces of the dead

I have blogged about bulk_extractor on several occasions.   As it is such an essential and useful tool for forensicators, I thought I would do a post dedicated to the tool.

Bulk_extractor is tool that scans a disk, disk image or directory of files for potentially evidential data.   It has a number of scanners that search for various artifacts such as urls, email addresses, credit card numbers, telephone numbers and json strings.   The recovery of json strings is particularly useful as a number of chat clients will generate chat messages in json format.  

The url recovery is another extremely useful feature.  We probably have our favourite web history parsers and web history recovery tools.   However recovering all the available web browser history on a disk is incredibly different.  Once again you really need to know what your tool of choice is doing.   So, if you have a tool that claims to recover Internet Explorer history do you understand how it works?  Maybe it is looking for the index.dat file signature?  This is useful...up to a point.   It will work fine for index.dat files stored in contiguous clusters, but what happens if the index.dat file is fragmented?  Under these cirumstances you tool may only recover the first chunk of index.dat data, as this chunk contains the index.dat header.   Your tool may therefore miss thousands of entries that reside in the other fragmented chunks.   Most respectable tools for recovering IE index.dat look for consistent and unique features associated with each record in an index.dat file, thus ensuring that as many entries as possible are recovered.  Other web browser history may be stored in a mysql database, finding the main database header is simple enough, even analysing the sqlite file to establish if it is a web history file is simply.  However, it get much more difficult if there is only a chunk of the database available in unallocated space.   Some tools are able to recover web history from such fragments in some circumstances - does your tool of choice do this?  Have you tested your assumptions.  
Some web history is stored in such a way that there are no consistent and unique record features in the web history file.  Opera web history has a simple structure that doesn't stretch much beyond storing the web page title, the url and a single unix timestamp.   There are no "landmarks" in the file that a programmer can use to recover the individual records if the web history gets fragmented on the disk.   Yahoo browser web history files pose much the same problem.  
bulk_extractor overcomes these problems by simply searching for urls on the disk.  It ingests 16mb chunks of data from your input stream (disk or disk image), extracts all the ascii strings and analyses them to see if there are any strings that have the characteristics of a url i.e they start with "http[s]" or "ftp" and have the structure of a domain name.   In this way you can be confident that you have recovered as much web history as possible.   However, there is a big downside here - you will also recover LOTS of urls that aren't part of a web history file.  You will recover urls that are in text files, pdf files but most likely urls that are hyper-links from raw web pages.   Fortunately, the output from bulk_extractor can help you here.   Bulk_extractor will create a simple text file of the urls that it finds.  It will first list the byte offset of the url, the url itself and a context entry, this shows the url with a number of bytes either side of it - it does what it says on the tin, gives you the context that the url was found in.  I have split a line of output from bulk extractor for ease of viewing.  The first line shows the byte offset of the disk where the url was found, the second line shows the url, the 3rd line shows the url in context as it appears on the disk.
81510931
http://www.gnu.org/copyleft/fdl.html
opyright" href="http://www.gnu.org/copyleft/fdl.html" />\x0A    <title>

As can be seen, the url, when viewed in context,  is preceded by "href=", this indicates that the url is actually a hyperlink from a raw web page.
Bulk_extractor doesn't stop there though.  It will also analyse the recovered url and generate some more useful data.  One of the files that it generates is url_searches.txt - this contains any urls associated with searches.  The file will also show the search terms used and the number of times that the search urls appear on the disk, so a couple of lines of output might look like this:
n=5 firefox
n=4 google+video

You may need to either parse individual records with your favourite web history parser or you may have to do it manually if your favourite web history recovery tool fails to recover the url that bulk_extractor found - this has happened to me on several occasions!

bulk_extractor also creates histograms, these are files that show the urls along with the number of times the urls appear on the  disk, sorted in order of popularity.  Some sample output will look like this:


n=1715 http://www.microsoft.com/contentredirect.asp. (utf16=1715)
n=1067 http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul
n=735 http://www.mozilla.org/MPL/
n=300 http://www.mozilla.org/xbl (utf16=2)
n=292 http://go.microsoft.com/fwlink/?LinkId=23127 (utf16=292)
n=228 http://home.netscape.com/NC-rdf#Name (utf16=49)
n=223 http://ocsp.verisign.com
n=220 http://www.DocURL.com/bar.htm

Notice how urls that are multi-byte encoded are also recovered by default.   Obviously there are going to be a LOT of urls that will appear on most hard drives, urls associated with microsoft, mozilla, verisign etc.
You can download some white lists that will suppress those type of urls (and emails and other features recovered by bulk_extractor).   Other types of url analysis that bulk_extractor does is to discover facebook id numbers, and skydrive id numbers.   Of course, it is trivially simple to write your own scripts to analyse the urls to discover other interesting urls.  I have written some to identify online storage, secure transactions and online banking urls.

bulk_extractor will also recover email addresses and creates historgrams for them.   The important thing to remember here is that there isn't (currently) any pst archive decompression built into bulk_extractor.   Therefore if your suspect is using outlook email client you will have to process any pst archive structures manually.   Other than that, the email address recovery works in exactly the same way as url recovery.

Using bulk_extractor can be a bit daunting, depending what you are looking for and searching through.  But to recover urls and emails on a disk image called bad_guy.dd the command is:
bulk_extractor -E email -o bulk bad_guy.dd

By default ALL scanners are turned on (the more scanners enabled, the longer it will take to run) using the -E switch disables all the scanners EXCEPT the named scanner, in our case the "email" scanner (this scanner recovers email addresses and urls).  The -o option is followed by the directory name you want to send the results to (which must not currently exist - bulk_extractor will create the directory).

I urge you all to download and experiment with it, there is a Windows gui that will process E01 image files and directory structures.   There are few cases that I can think of when I don't run bulk_extractor, there are a number of occasions where I have recovered crucial urls or json chat fragments missed by other tools.



No comments:

Post a Comment