Compare commits

...

16 Commits

Author SHA1 Message Date
520303cd03 Merge branch 'master' of https://git.saret.tk/saret/BulkBooks 2023-05-02 22:33:40 +03:00
14bd12ee8f books to download 2023-05-02 22:33:31 +03:00
Benny Saret
6e0446af3f Merge pull request 'master' (#3) from Mooooooooo/BulkBooks:master into master
Reviewed-on: https://git.saret.tk/saret/BulkBooks/pulls/3
2023-05-02 14:47:44 +03:00
saret
18f4876291 Merge branch 'master' of https://git.saret.tk/Mooooooooo/BulkBooks 2023-05-02 14:46:22 +03:00
8163dd903d Modify The Readme
Added explanations of how to use the script.
2023-05-01 23:08:46 +03:00
6d19d28f6e prettified the code 2023-05-01 22:56:39 +03:00
work
d73c5c579f Merge branch 'master' of https://git.saret.tk/Mooooooooo/BulkBooks 2023-04-25 12:54:33 +03:00
work
4eefe79bab added requirements 2023-04-25 12:52:57 +03:00
864ab7ad1e Merge pull request 'master' (#1) from saret/BulkBooks:master into master
Reviewed-on: https://git.saret.tk/Mooooooooo/BulkBooks/pulls/1
2023-04-25 12:44:39 +03:00
3055dc05f9 typo 2023-04-25 12:28:23 +03:00
29f6324bc2 Merge pull request 'master' (#2) from Mooooooooo/BulkBooks:master into master
Reviewed-on: https://git.saret.tk/saret/BulkBooks/pulls/2
2023-04-25 12:22:46 +03:00
work
7fc8519e4a path location 2023-04-25 12:20:06 +03:00
work
8cbdcee256 headless 2023-04-25 12:17:03 +03:00
d9dfaaa68c preformence improvement 2023-04-25 12:11:12 +03:00
955a7ed6f5 Merge pull request 'update bookbulk' (#1) from Mooooooooo/BulkBooks:master into master
Reviewed-on: https://git.saret.tk/saret/BulkBooks/pulls/1
2023-04-25 11:33:45 +03:00
116858bf48 update bookbulk 2023-04-25 11:32:01 +03:00
7 changed files with 276 additions and 1 deletions

3
.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
.vscode/
venv/
ignore/

33
BooksToDownload Normal file
View File

@@ -0,0 +1,33 @@
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110994706
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110918661
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110930833
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110933810
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110938002
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110938002
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110942621
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110942621
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110948032
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110948032
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=110959329
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=105256337
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=105018830
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=109444642
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=109400325
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=109392561
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=107884166
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=97645077
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=97645077
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=102594097
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=102591827
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=102588217
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=102589202
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=101052334
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=101048613
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=101986400
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=100976710
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=100974786
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=108426718
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=108236946
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=106246523
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=104502712
https://kotar.cet.ac.il/KotarApp/Viewer.aspx?nBookID=103818662

View File

@@ -1,2 +1,12 @@
# BulkBooks # Bulk Books:
This script's goal is to help you to download books from [Kotar](https://kotar.cet.ac.il/).
## How To?
1. You need an Academic Access to Kotar.
1. You need to have python>=3.9
1. download the requiremetns. (It might be prefered by using venv).
1. Add the links to the __BooksToDownload__ file.
1. Run the script.
Enjoy.

191
download books in bulks.py Normal file
View File

@@ -0,0 +1,191 @@
import selenium
from selenium.webdriver.common import action_chains
import urllib3
import bs4
import re
import os
import glob
from selenium.common import exceptions
from selenium import webdriver
import img2pdf
import threading
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.actions import interaction
from selenium.webdriver.common import keys
# from parser import ArgumentParser
ACTS = []
LAST_ACTS = []
SOURCES = []
Books = []
THREADS = []
PATHS = []
OLD_REMOVE = []
BROWSER_PREFENCES = {"browser.download.folderList": 2, "browser.download.manager.showWhenStarting": False, "browser.download.dir": "ignore", "browser.helperApps.neverAsk.saveToDisk": "attachment/csv, text/plain, application/octet-stream, application/binary, text/csv, application/csv, application/excel, text/comma-separated-values, text/xml, application/xml, application/xls, excel/xls, application/excel 97-2003,application/Microsoft Excel 97-2003 Worksheet, application/vnd.ms-excel", "browser.helperApps.neverAsk.openFile":
"application/PDF, application/FDF, application/XFDF, application/LSL, application/LSO, application/LSS, application/IQY, application/RQY, application/XLK, application/XLS, application/XLT, application/POT application/PPS, application/PPT, application/DOS, application/DOT, application/WKS, application/BAT, application/PS, application/EPS, application/WCH, application/WCM, application/WB1, application/WB3, application/RTF, application/DOC, application/MDB, application/MDE, application/WBK, application/WB1, application/WCH, application/WCM, application/AD, application/ADP, application/vnd.ms-excel", "browser.download.panel.shown": False}
def remove_text(text: str):
'''remove_text removes the url from the text file
:param text: url to remove
:type text: str
'''
with open("BooksToDownload", "r", encoding="utf_8") as file:
data_text = file.readlines()
if text in data_text:
data_text.pop(data_text.index(text))
with open("BooksToDownload", 'w', encoding='utf_8') as file:
file.writelines(data_text)
else:
remove_text(f"{text}\n")
def set_folder_name(html: bs4.BeautifulSoup):
name = html.find("title").text if html.text else " None"
return name[name.find(" - ")+3:].replace('"', "''").replace("\\", '||').replace(r':', r'׃').replace(r"/", r"|").replace("\n", "").replace('?', ';;')
def down_to_list(url: str):
data = urllib3.PoolManager().request("GET", url)
return data.data if data.data else None
def _split_styler(style: str):
begin = style.find('"')+1
end = style.rfind('"')
return style[begin:end]
def update_SOURCES(index: int):
global SOURCES, ACTS
keys_html = bs4.BeautifulSoup(
ACTS[index]._driver.page_source, "html.parser").find_all(
"div", attrs={"class": "BV_oImage"})
dic_update = {key.attrs["id"]: _split_styler(key.attrs["style"])
for key in keys_html if "http" in key.attrs["style"]}
SOURCES[index].update(dic_update)
def do_action_now(index: int):
global SOURCES
global ACTS
ACTS[index].perform()
name = set_folder_name(bs4.BeautifulSoup(ACTS[index]._driver.page_source, "html.parser"))
update_SOURCES(index)
if not os.path.exists("ignore/"+f"{name}"):
os.mkdir("ignore/"+f"{name}")
files = list(SOURCES[index])
for s in SOURCES[index]:
if SOURCES[index][s]:
if not os.path.exists(f"ignore/{name}/{files.index(s):04}.jpg"):
with open(f"ignore/{name}/{files.index(s):04}.jpg", "wb") as F:
F.write(urllib3.PoolManager().request("GET", SOURCES[index][s]).data)
# files[files.index(s)] = urllib3.PoolManager().request("GET", SOURCES[index][s]).data
return files
# def check_and_act(index: int,last):
def get_first_empty(index: int):
global SOURCES
for s in SOURCES[index]:
if not SOURCES[index][s]:
return s
return None
def act_now(index: int, path: str = None):
global SOURCES
global ACTS
global couters
global TREADS
global OLD_REMOVE
global LAST_ACTS
global treads
s = 0
name = set_folder_name(bs4.BeautifulSoup(ACTS[index]._driver.page_source, "html.parser"))
save_first = ""
last = list(SOURCES[index].keys())[-1]
while "" in SOURCES[index].values():
if s == 0:
LAST_ACTS[index].perform()
s = 1
save_first = get_first_empty(index)
url_now = ACTS[index]._driver.current_url
url_now = url_now[:url_now.find("#")+1] + save_first
if SOURCES[index][last] and "" in SOURCES[index].values():
ACTS[index]._driver.get(url_now)
do_action_now(index)
SOURCES[index][last] = ""
else:
do_action_now(index)
if SOURCES[index] and "" not in SOURCES[index].values():
couters += 1
pathus = f'{path}/{name}.pdf' if path else f"ignore/{name}/{name}.pdf"
with open(pathus, "wb") as file:
file.write(img2pdf.convert(glob.glob(f"ignore/{name}/*.jpg")))
ACTS[index]._driver.quit()
remove_text(OLD_REMOVE[index])
treads -= 1
def open_firefox(url: str):
'''open_firefox opens the firefox browser on the specific url, and sets all the settings for the specific session
:param url: url to run the firefox on
:type url: str
'''
web = give_me_web()
global SOURCES
global ACTS
global LAST_ACTS
if not url.startswith("#") and url:
book = webdriver.Firefox(web[0], executable_path=web[1], options=web[2])
url = url if url.endswith("#1.undefined.8.none") else f'{url}#1.undefined.8.none'
book.get(url)
act = action_chains.ActionChains(book)
lst_act = action_chains.ActionChains(book)
lst_act._actions = [lst_act.key_down(keys.Keys.END), lst_act.pause(3), lst_act.key_up(keys.Keys.END)]
act._actions = [act.send_keys(keys.Keys.PAGE_DOWN)]
LAST_ACTS.append(lst_act)
ACTS.append(act)
SOURCES.append({})
def give_me_web():
options = webdriver.FirefoxOptions()
fp = webdriver.FirefoxProfile()
for key,val in BROWSER_PREFENCES:
fp.set_preference(key, val)
options.add_argument('--lang=EN')
options.headless = True
fire = "geckodriver"
return (fp, fire, options)
if __name__ == "__main__":
with open("BooksToDownload", "r", encoding="utf_8") as file:
books = file.read().split("\n")
for b in books:
if b.find('````') > -1 and not b.startswith("#"):
OLD_REMOVE.append(b)
PATHS.append(b[b.rfind('`')+1:])
b = b[:b.find('`')]
elif not b.startswith("#"):
OLD_REMOVE.append(b)
PATHS.append(None)
t1 = threading.Thread(None, open_firefox, args=(b,))
t1.start()
THREADS.append(t1)
for t in THREADS:
t.join()
lasts = []
for i in range(len(ACTS)):
SOURCES[i].update({key.attrs["id"]: ""
for key in bs4.BeautifulSoup(ACTS[i]._driver.page_source, "html.parser").find_all(
"div", attrs={"class": "BV_oImage"})})
lasts.append(list(SOURCES[i].keys())[-1])
couters = 0
treads = len(ACTS)-1
for i in range(len(ACTS)):
T = threading.Thread(None, act_now, args=(i, PATHS[i]))
T.start()

BIN
geckodriver Executable file

Binary file not shown.

13
geckodriver.log Normal file
View File

@@ -0,0 +1,13 @@
1682411282110 geckodriver INFO Listening on 127.0.0.1:60235
1682411282158 mozrunner::runner INFO Running command: MOZ_CRASHREPORTER="1" MOZ_CRASHREPORTER_NO_REPORT="1" MOZ_CRASHREPORTER_SHUTDOWN="1" MOZ_NO_REMOTE="1" "/usr ... EN" "--remote-debugging-port" "52929" "--remote-allow-hosts" "localhost" "-no-remote" "-profile" "/tmp/rust_mozprofileyJU3uD"
1682411283746 Marionette INFO Marionette enabled
1682411283750 Marionette INFO Listening on port 41397
WebDriver BiDi listening on ws://localhost:52929
Read port: 41397
1682411283910 RemoteAgent WARN TLS certificate errors will be ignored for this session
console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /tmp/rust_mozprofileyJU3uD/search.json.mozlz4", (void 0)))
Missing chrome or resource URL: resource://gre/modules/UpdateListener.jsm
Missing chrome or resource URL: resource://gre/modules/UpdateListener.sys.mjs
DevTools listening on ws://localhost:52929/devtools/browser/6a936edc-696b-4097-8331-7af0838c543e
JavaScript warning: https://cdn.cet.ac.il/libs/cet.editorManager/1.0/cetEditorManager.js, line 733: unreachable code after return statement
JavaScript warning: https://kotar.cet.ac.il/ClientResourcesServingHandler.ashx?h=420b0f7c6bfad886e847426ac12334514f5f4fc6&t=javascript&minify=True, line 100: unreachable code after return statement

25
requirements.txt Normal file
View File

@@ -0,0 +1,25 @@
async-generator==1.10
attrs==23.1.0
beautifulsoup4==4.12.2
bs4==0.0.1
certifi==2022.12.7
chromedriver-autoinstaller==0.4.0
deprecation==2.1.0
exceptiongroup==1.1.1
h11==0.14.0
idna==3.4
img2pdf==0.4.4
lxml==4.9.2
outcome==1.2.0
packaging==23.1
pikepdf==7.2.0
Pillow==9.5.0
PySocks==1.7.1
selenium==4.9.0
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.4.1
trio==0.22.0
trio-websocket==0.10.2
urllib3==1.26.15
wsproto==1.2.0