@shaobaobaoer
2018-02-12T08:20:33.000000Z
字数 2784
阅读 1150
塔罗 tarot 爬虫
This is the advanced part of my previous article "Tarot Reader Data Crawling"
In this article I will introduce how to crawl the data from google translation and renew my database. To be honst, I refer to many referrences and choosing requests lib and execJS as the engine of my worm.
TK arguement is google's algorithm to transffer the target language to a complex string of number which hide in get request.
Thanks a billion for cocoa520 's script for calculating the TK arguement in Google translation.
and here is the link which makes a great convinence in crawling the translated article
LINK
To change it to python code . I recommend execJS . U can get it by pip install execJS.
import execjsclass Py4Js():def __init__(self):self.ctx = execjs.compile("""#copy JS here""")def getTk(self,text):return self.ctx.call("TL",text)
Besides the tk arguement, other arguement seems to be simple. Use your burpsuit to catch the request and reloaded it by Python.
import requestsfrom HandleJs import Py4Js # cocoa520's script means HandleJSdef translate(tk,content):if len(content) > 4891:print("[-] too long for translation")returnparam = {'tk': tk, 'q': content}result = requests.get("""http://translate.google.cn/translate_a/single?client=t&sl=en&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss&dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1&srcrom=0&ssel=0&tsel=0&kc=2""", params=param)#Get the json list;for text in result.json():print(text)def main():js = Py4Js()content=" To be or not to be ,that is a question."tk = js.getTk(content)translate(tk,content)if __name__ == "__main__":main()
However the result is not clear enough to save in database.
To handle it and select Chinses sentence, u need another function.
RE = (result.json()[0])result = ""for i in RE:if type(i[0]) == type("i"):result += i[0].split()[0].strip("\n")print(type(result))return result
OK! Here is the last step.
Combine these functions and save the result in database, make sure that the charset is UTF-8.
def translate_main_function():for table_name in big_arcana_dict.values():for id in "123456":database_translate_content(table_name, id)def database_insert_translate(tables_name, id, text):"""I write the result in file, to ensure the speed."""sql = "UPDATE %s SET cn_data='%s' WHERE id=%s;" % (tables_name, text, id)f = open("sql.txt", "a")f.write(sql + "\n" + "COMMIT;" + "\n")def database_translate_content(table, id):db = pymysql.connect("localhost", "root", "", "tarot")cursor = db.cursor()sql = "SELECT * from %s WHERE id=%s LIMIT 0,1" % (table, id)# print(sql)try:cursor.execute(sql)results = cursor.fetchall()for row in results:text = row[1]text = (T_tool(text))#print(text)database_insert_translate(table, id, text)except:print("Error: unable to fetch data")db.close()
Before running the code , renew the tarot database and create a new column to save translation version.
The whole code u can enter my github to view or copy.
Github href
Just for learning . Not for commerce