@shaobaobaoer 2018-02-12T08:20:33.000000Z 字数 2784 阅读 1069

PYTHON 爬虫 ———— Tarot Reader Google Tanslation

塔罗 tarot 爬虫

0x01 Abstract

This is the advanced part of my previous article "Tarot Reader Data Crawling"

In this article I will introduce how to crawl the data from google translation and renew my database. To be honst, I refer to many referrences and choosing requests lib and execJS as the engine of my worm.

0x02 To handle TK arguement

TK arguement is google's algorithm to transffer the target language to a complex string of number which hide in get request.

Thanks a billion for cocoa520 's script for calculating the TK arguement in Google translation.
and here is the link which makes a great convinence in crawling the translated article
LINK

To change it to python code . I recommend execJS . U can get it by pip install execJS.

    import execjs  
    class Py4Js():  
        def __init__(self):  
            self.ctx = execjs.compile(""" 
            #copy JS here
        """)  
        def getTk(self,text):  
            return self.ctx.call("TL",text)

0x03 Start The Translation

Besides the tk arguement, other arguement seems to be simple. Use your burpsuit to catch the request and reloaded it by Python.

import requests    
from HandleJs import Py4Js      # cocoa520's script means HandleJS
    def translate(tk,content):     
        if len(content) > 4891:      
            print("[-] too long for translation")      
            return    
        param = {'tk': tk, 'q': content}  
        result = requests.get("""http://translate.google.cn/translate_a/single?client=t&sl=en 
            &tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss 
            &dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1&srcrom=0&ssel=0&tsel=0&kc=2""", params=param)  
        #Get the json list;
        for text in result.json():  
            print(text)  
    def main():      
        js = Py4Js()      
        content=" To be or not to be ,that is a question."    
        tk = js.getTk(content)      
        translate(tk,content)      
    if __name__ == "__main__":      
        main()

0x04 Handling the result.

However the result is not clear enough to save in database.
To handle it and select Chinses sentence, u need another function.

    RE = (result.json()[0])
    result = ""
    for i in RE:
        if type(i[0]) == type("i"):
            result += i[0].split()[0].strip("\n")
    print(type(result))
    return result

0x05 Save Them In database !

OK! Here is the last step.
Combine these functions and save the result in database, make sure that the charset is UTF-8.

def translate_main_function():
    for table_name in big_arcana_dict.values():
        for id in "123456":
            database_translate_content(table_name, id)
def database_insert_translate(tables_name, id, text):
    """
    I write the result in file, to ensure the speed.
    """
    sql = "UPDATE %s SET cn_data='%s' WHERE id=%s;" % (tables_name, text, id)
    f = open("sql.txt", "a")
    f.write(sql + "\n" + "COMMIT;" + "\n")
def database_translate_content(table, id):
    db = pymysql.connect("localhost", "root", "", "tarot")
    cursor = db.cursor()
    sql = "SELECT * from %s WHERE id=%s LIMIT 0,1" % (table, id)
    # print(sql)
    try:
        cursor.execute(sql)
        results = cursor.fetchall()
        for row in results:
            text = row[1]
            text = (T_tool(text))
            #print(text)
            database_insert_translate(table, id, text)
    except:
        print("Error: unable to fetch data")
    db.close()

Before running the code , renew the tarot database and create a new column to save translation version.

0x06 To Download The Newest Code!

The whole code u can enter my github to view or copy.
Github href

Just for learning . Not for commerce