@shaobaobaoer
2018-02-12T08:20:33.000000Z
字数 2784
阅读 927
塔罗
tarot
爬虫
This is the advanced part of my previous article "Tarot Reader Data Crawling"
In this article I will introduce how to crawl the data from google translation and renew my database. To be honst, I refer to many referrences and choosing requests lib and execJS as the engine of my worm.
TK arguement is google's algorithm to transffer the target language to a complex string of number which hide in get request.
Thanks a billion for cocoa520 's script for calculating the TK arguement in Google translation.
and here is the link which makes a great convinence in crawling the translated article
LINK
To change it to python code . I recommend execJS . U can get it by pip install execJS
.
import execjs
class Py4Js():
def __init__(self):
self.ctx = execjs.compile("""
#copy JS here
""")
def getTk(self,text):
return self.ctx.call("TL",text)
Besides the tk arguement, other arguement seems to be simple. Use your burpsuit to catch the request and reloaded it by Python.
import requests
from HandleJs import Py4Js # cocoa520's script means HandleJS
def translate(tk,content):
if len(content) > 4891:
print("[-] too long for translation")
return
param = {'tk': tk, 'q': content}
result = requests.get("""http://translate.google.cn/translate_a/single?client=t&sl=en
&tl=zh-CN&hl=zh-CN&dt=at&dt=bd&dt=ex&dt=ld&dt=md&dt=qca&dt=rw&dt=rm&dt=ss
&dt=t&ie=UTF-8&oe=UTF-8&clearbtn=1&otf=1&pc=1&srcrom=0&ssel=0&tsel=0&kc=2""", params=param)
#Get the json list;
for text in result.json():
print(text)
def main():
js = Py4Js()
content=" To be or not to be ,that is a question."
tk = js.getTk(content)
translate(tk,content)
if __name__ == "__main__":
main()
However the result is not clear enough to save in database.
To handle it and select Chinses sentence, u need another function.
RE = (result.json()[0])
result = ""
for i in RE:
if type(i[0]) == type("i"):
result += i[0].split()[0].strip("\n")
print(type(result))
return result
OK! Here is the last step.
Combine these functions and save the result in database, make sure that the charset is UTF-8.
def translate_main_function():
for table_name in big_arcana_dict.values():
for id in "123456":
database_translate_content(table_name, id)
def database_insert_translate(tables_name, id, text):
"""
I write the result in file, to ensure the speed.
"""
sql = "UPDATE %s SET cn_data='%s' WHERE id=%s;" % (tables_name, text, id)
f = open("sql.txt", "a")
f.write(sql + "\n" + "COMMIT;" + "\n")
def database_translate_content(table, id):
db = pymysql.connect("localhost", "root", "", "tarot")
cursor = db.cursor()
sql = "SELECT * from %s WHERE id=%s LIMIT 0,1" % (table, id)
# print(sql)
try:
cursor.execute(sql)
results = cursor.fetchall()
for row in results:
text = row[1]
text = (T_tool(text))
#print(text)
database_insert_translate(table, id, text)
except:
print("Error: unable to fetch data")
db.close()
Before running the code , renew the tarot database and create a new column to save translation version.
The whole code u can enter my github to view or copy.
Github href
Just for learning . Not for commerce