@shaobaobaoer
2018-02-10T03:45:42.000000Z
字数 3717
阅读 845
塔罗
tarot
爬虫
Why not write a python worm to practice what I have learned these days?
Yesterday night , I said to myself.
I should write a python worm which contains the funciton of :
OK Let's Do it !
So, which website should we choose?
When u are searching "tarot
" on Google , what u find ? Yeah! The first web always is free-tarot-reading.net, which, I think, attract lots of famous tarot-scholer.
The online free tarot reading is very attractive, possibly. Ok, click the first column:
Universal 6 Card Spread by LT ZEN Mode /switch Animations
Oh~~ Did u find some secret in url?
https://www.free-tarot-reading.net/readings/135433044
View the detail at the end of url ———————— Path Leak (maybe it is not an term)
How about using other number at the end of URL?
The article num is random , but it still has some rules. I'm not the admin of that website, but we can strong our worm to select the useful information.
OK, our target is definite:
the Universal 6 Card Spread
- Using 21 Big-Arcana
- 1 card has respective meaning in 6 location
So , we create a database means tarot. and set 21 tables, containing card location and its meaning.
big_arcana_dict is a dictionary . Its key is card number and its value is card name.U can do this
def database_create():
db = pymysql.connect("localhost", "root", "", "tarot")
cursor = db.cursor()
for i in range(0, 22):
string = "CREATE TABLE %s ( id int(5) NOT NULL , text varchar(1000))" % big_arcana_dict[i]
cursor.execute(string)
cursor.close()
db.close()
As a fan of webdriver, I select it as the engine of my worm.
def wdriver(a):
x = webdriver.Firefox()
i = 0
startnum = 135432416
url = "https://www.free-tarot-reading.net/readings/%s" % (startnum + a)
for i in range(0, 6):
try:
x.get(url)
print("[+] Get payload %s" % (startnum + a), time.ctime())
content_1 = (x.find_element_by_xpath(xpath_list[i]).text)
print (content_1)
except:
errorhander("wdriver error")
time.sleep(5)
x.close()
To confirm information available, we should check it.
What we should do is listed
Retain \' and \n make it convenient for sql operation
Check the head of content if it is not the card-reading for Universal 6 Card Spread
Recheck the card name if it is not the card in big arcana , the information is still useless
# stringmaker(content_1)
def stringmaker(string):
string = string.split("\n")
i = 0
id = 0
card = ""
text = ""
if "Card" in string[0] and ord(string[0][5]) > 48 and ord(string[0][5]) < 57:
id = int(string[0][5])
# this error hander still has some problem
else:
errorhander("no id")
for i in range(0, 22):
if big_arcana_dict[i] in string[1]:
card = big_arcana_dict[i]
break
for i in range(2, len(string)):
text += string[i].replace("'", "\\'") + '\\n'
# retain \' and \n make it convenient for sql operation
if text == "":
errorhander("no text")
database_write(id, card, text)
We should check whether infromation is repetitive or not.
Using pymysql and find it.
def database_write(id, tables_name, text):
if database_avoid_cycle(tables_name, id) == True:
db = pymysql.connect("localhost", "root", "", "tarot")
cursor = db.cursor()
sql = "INSERT INTO %s VALUES (%s, '%s')" % (tables_name, id, text)
try:
cursor.execute(sql)
db.commit()
print("[+] Success to upload data with ", id, tables_name, time.ctime())
except:
db.rollback()
print("Error")
db.close()
else:
pass
def database_avoid_cycle(table, id):
db = pymysql.connect("localhost", "root", "", "tarot")
cursor = db.cursor()
sql = "SELECT * from %s WHERE id=%s LIMIT 0,1" % (table, id)
try:
cursor.execute(sql)
results = cursor.fetchall()
if results == ():
db.close()
return True
for row in results:
text = row[1]
if text != "":
db.close()
print("[-] Fail to upload data for the data has been existed")
return False
except:
print("[-] error ")
db.close()
return True
Many other things will still leave for me to finish.
The tarot reading funtion will be updated in one week if everything goes smoothly.
Other operation I will not introduce in this article.
The whole code u can enter my github to view or copy.
Github href
Just for learning . Not for commerce