为了优化pocsuite3的并发效率,考虑引入协程,所以做了如下测试。
测试1
本地测试
测试代码
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Time : 2019/6/21 10:40 AM # @Author : w8ay # @File : xiechenga.py # 协程 import asyncio import queue import threading import time import aiohttp import requests class threadpool: def __init__(self, threadnum): self.thread_count = self.thread_nums = threadnum self.queue = queue.Queue() self.isContinue = True self.thread_count_lock = threading.Lock() def push(self, payload): self.queue.put(payload) def changeThreadCount(self, num): self.thread_count_lock.acquire() self.thread_count += num self.thread_count_lock.release() def stop(self): self.isContinue = False def run(self): th = [] for i in range(self.thread_nums): t = threading.Thread(target=self.scan) t.setDaemon(True) t.start() th.append(t) # It can quit with Ctrl-C try: while 1: if self.thread_count > 0 and self.isContinue: time.sleep(0.01) else: break except KeyboardInterrupt: exit("User Quit") def scan(self): while 1: if self.queue.qsize() > 0 and self.isContinue: p = self.queue.get() else: break try: resp = requests.get(p) a = resp.text except: pass self.changeThreadCount(-1) async def request(url): try: async with aiohttp.ClientSession() as session: async with session.get(url) as resp: content = await resp.text() a = content except: pass def aiorequests(url_list): loop = asyncio.get_event_loop() tasks = [] for url in url_list: task = loop.create_task(request(url)) tasks.append(task) loop.run_until_complete(asyncio.wait(tasks)) if __name__ == '__main__': url_list = ["http://www.baidu.com", "http://www.hacking8.com", "http://www.seebug.org", "https://0x43434343.com/", "https://lorexxar.cn/", "https://0x48.pw/", "https://github.com"] print("测试url:{}".format(repr(url_list))) url_list = url_list * 100 print("数据总量:{}".format(len(url_list))) start_time = time.time() aiorequests(url_list) print("协程 cost time", time.time() - start_time) start_time = time.time() http_pool = threadpool(50) for i in url_list: http_pool.push(i) http_pool.run() print("多线程 cost time", time.time() - start_time)
系统及相关配置环境
Mac Mojave 10.14.5
Python 3.7.2
requests 2.21.0 aiohttp 3.5.4
其他:线程数 50
带宽
数据总量:210
两者速度差不多,因为线程数量是50,可能时间都花在了启动线程上面。
数据总量:700
测试url:['http://www.baidu.com', 'http://www.hacking8.com', 'http://www.seebug.org', 'https://0x43434343.com/', 'https://lorexxar.cn/', 'https://0x48.pw/', 'https://github.com'] 数据总量:700 协程 cost time 94.01872515678406 多线程 cost time 44.3960919380188
数据总量:3500
测试url:['http://www.baidu.com', 'http://www.hacking8.com', 'http://www.seebug.org', 'https://0x43434343.com/', 'https://lorexxar.cn/', 'https://0x48.pw/', 'https://github.com'] 数据总量:3500 Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x105ad5ac8> transport: <_SelectorSocketTransport fd=604 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x105a71160> transport: <_SelectorSocketTransport fd=465 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x105dbb390> transport: <_SelectorSocketTransport fd=583 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x1037707f0> transport: <_SelectorSocketTransport fd=179 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x103175be0> transport: <_SelectorSocketTransport fd=3365 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x102f1ceb8> transport: <_SelectorSocketTransport fd=3364 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x106799e48> transport: <_SelectorSocketTransport fd=3286 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out 协程 cost time 301.6471061706543 多线程 cost time 96.64304375648499
可以看到多线程是优于协程。
修改后代码,数据总量700
协程运行时出现了部分报错,主要在ssl的读取写入方面,于是尝试限定协程的并发数量,加入了协程信号量用于限定协程并发数量。
#!/usr/bin/env python3 # -*- coding: utf-8 -*- # @Time : 2019/6/21 10:40 AM # @Author : w8ay # @File : xiechenga.py # 协程 import asyncio import queue import threading import time import aiohttp import requests class threadpool: def __init__(self, threadnum): self.thread_count = self.thread_nums = threadnum self.queue = queue.Queue() self.isContinue = True self.thread_count_lock = threading.Lock() def push(self, payload): self.queue.put(payload) def changeThreadCount(self, num): self.thread_count_lock.acquire() self.thread_count += num self.thread_count_lock.release() def stop(self): self.isContinue = False def run(self): th = [] for i in range(self.thread_nums): t = threading.Thread(target=self.scan) t.setDaemon(True) t.start() th.append(t) # It can quit with Ctrl-C try: while 1: if self.thread_count > 0 and self.isContinue: time.sleep(0.01) else: break except KeyboardInterrupt: exit("User Quit") def scan(self): while 1: if self.queue.qsize() > 0 and self.isContinue: p = self.queue.get() else: break try: resp = requests.get(p) a = resp.text except: pass self.changeThreadCount(-1) async def request(url, semaphore): async with semaphore: try: async with aiohttp.ClientSession() as session: async with session.get(url) as resp: content = await resp.text() a = content except: pass def aiorequests(url_list): loop = asyncio.get_event_loop() semaphore = asyncio.Semaphore(200) tasks = [] for url in url_list: task = loop.create_task(request(url, semaphore)) tasks.append(task) loop.run_until_complete(asyncio.wait(tasks)) if __name__ == '__main__': url_list = ["http://www.baidu.com", "http://www.hacking8.com", "http://www.seebug.org", "https://0x43434343.com/", "https://lorexxar.cn/", "https://0x48.pw/", "https://github.com"] print("测试url:{}".format(repr(url_list))) url_list = url_list * 100 print("数据总量:{}".format(len(url_list))) start_time = time.time() aiorequests(url_list) print("协程 cost time", time.time() - start_time) start_time = time.time() http_pool = threadpool(50) for i in url_list: http_pool.push(i) http_pool.run() print("多线程 cost time", time.time() - start_time)
返回结果如下
测试url:['http://www.baidu.com', 'http://www.hacking8.com', 'http://www.seebug.org', 'https://0x43434343.com/', 'https://lorexxar.cn/', 'https://0x48.pw/', 'https://github.com'] 数据总量:700 Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x1093470f0> transport: <_SelectorSocketTransport fd=129 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out Fatal read error on socket transport protocol: <asyncio.sslproto.SSLProtocol object at 0x109181a90> transport: <_SelectorSocketTransport fd=84 read=polling write=<idle, bufsize=0>> Traceback (most recent call last): File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/selector_events.py", line 801, in _read_ready__data_received data = self._sock.recv(self.max_size) TimeoutError: [Errno 60] Operation timed out 协程 cost time 125.1102237701416 多线程 cost time 37.17827892303467
数据总量:7000
测试url:['http://www.baidu.com', 'http://www.hacking8.com', 'http://www.seebug.org', 'https://0x43434343.com/', 'https://lorexxar.cn/', 'https://0x48.pw/', 'https://github.com'] 数据总量:7000 协程 cost time 820.6436007022858 多线程 cost time 197.8958179950714
远程测试
服务商:vultr
由于是国外服务器,测试地址换成了 baidu github google 同时将线程数量调整到20
测试代码
测试代码和上面修改后代码一致。
数据总量 30
root@vultr:~# python3 test.py 测试url:['http://www.baidu.com', 'http://github.com', 'http://google.com'] 数据总量:30 协程 cost time 6.938291788101196 多线程 cost time 0.7714250087738037 root@vultr:~#
奇怪的是在把baidu去掉后
root@vultr:~# python3 test.py 测试url:['http://github.com', 'http://google.com'] 数据总量:20 协程 cost time 0.5997216701507568 多线程 cost time 0.6937429904937744
两者近乎一样
数据总量 600
当测试url中不含baidu时
root@vultr:~# python3 test.py 测试url:['http://github.com', 'http://google.com'] 数据总量:600 协程 cost time 3.049583673477173 多线程 cost time 12.464868545532227
发现协程的效率变高了。
数据总量18000
先将baidu去掉,将数据总量增加到了18000
root@vultr:~# cat result1 测试url:['http://github.com', 'http://google.com'] 数据总量:18000 协程 cost time 91.383061170578 多线程 cost time 322.98390197753906 root@vultr:~#
发现协程效率似乎更高。
加入baidu.com后,程序运行了将近50分钟,仍然没有跑完。。。
最后跑完发现
多线程却是足足慢了10倍。
Go语言测试
为了更全面的对比协程,也用golang写了一个demo用作对比,代码如下, 限定了协程并发200
package main import ( "crypto/tls" "fmt" "io/ioutil" "net/http" "sync" "time" ) func Get(url string,ch chan int) (content []byte, err error) { transCfg := &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // disable verify } Client := &http.Client{ Timeout: 100 * time.Second, Transport: transCfg, } req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } resp, err2 := Client.Do(req) if err2 != nil { return nil, err } defer resp.Body.Close() bytes, _ := ioutil.ReadAll(resp.Body) <-ch return bytes, nil } func main() { var s1 = [...]string{"http://www.baidu.com", "http://www.hacking8.com", "http://www.seebug.org", "https://0x43434343.com/", "https://0x48.pw/", "https://lorexxar.cn/", "https://github.com"} number := 100 // 倍数 fmt.Printf("数据总量%d\n", number*len(s1)) wg := sync.WaitGroup{} t1 := time.Now() // get current time ch := make(chan int, 200) for i := 0; i < number; i++ { for _, v := range s1 { wg.Add(1) ch <- 1 go func(url string,c chan int) { defer wg.Done() //fmt.Println(url) _, err := Get(url,c) if err != nil{ fmt.Println(err) } //fmt.Println(len(content)) }(v,ch) } //fmt.Printf("c[%d]: %d\n", i, c[i]) } wg.Wait() elapsed := time.Since(t1) fmt.Println("elapsed: ", elapsed) }
数据总量 700
对比python的协程,似乎更慢了一些。。
数据总量 7000
还没有跑出来。。总之很慢。。
测试2
晚上和公司大佬们讨论了下,同时又发现了一个奇怪的现象,有的网址用协程很快,有的网址用协程很慢,最后制定了一个全面的测试逻辑。
- 将源码分离为协程测试脚本与线程测试脚本
- 在一个vps上进行测试,公司网络波动比较大
- 协程的信号量与线程的数目也是控制变量
- go语言重写为生产者消费者模型
单网址测试
本地测试
同样使用测试1中的代码,只对github.com一个网址进行测试。先测试协程,并设置协程并发为200。
测试1000个网站协程耗时21s
同样在使用多线程,设置线程总数为200
测试1000个网站多线程耗时16s
再次调整为5000个网址,协程并发为200。
测试5000个网站协程耗时133s
测试5000个网站多线程耗时66s
远程测试
为了排除公司网络波动造成的影响,在腾讯云1G1H1M的centos服务器上进行相同测试。
同样测试1000个网址,时间只需要18s
同样的,测试1000个网址协程竟需要38s,可能怀疑协程并发量上面设置了上限的问题,将协程并发量改大了一点,改为8000,结果耗时更长了。
go语言测试
package main import ( "crypto/tls" "fmt" "io/ioutil" "net/http" "sync" "time" ) func Get2(url string) (content []byte, err error) { transCfg := &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // disable verify } Client := &http.Client{ Timeout: 100 * time.Second, Transport: transCfg, } req, err := http.NewRequest("GET", url, nil) if err != nil { return nil, err } resp, err2 := Client.Do(req) if err2 != nil { return nil, err } defer resp.Body.Close() bytes, _ := ioutil.ReadAll(resp.Body) return bytes, nil } func main() { var s1 = [...]string{"https://github.com"} number := 1000 // 倍数 fmt.Printf("数据总量%d\n", number*len(s1)) wg := sync.WaitGroup{} t1 := time.Now() // get current time consumer := make(chan string,20) // 消费者 for i := 0; i < 1000; i++ { go func() { for { url := <-consumer _, err := Get2(url) if err != nil { fmt.Println(err) } wg.Done() } }() } // 生产者 for i := 0; i < number; i++ { for _, v := range s1 { wg.Add(1) consumer <- v } } wg.Wait() elapsed := time.Since(t1) fmt.Println("elapsed: ", elapsed) }
不明白同样的网站,效率咋这么低。
多网址测试
测试环境在vultr debian服务器上,只加了两个网站 url_list = ["http://github.com","http://google.com"]
协程和多线程是分开测试的,效果如图
测试url:['http://github.com', 'http://google.com'] 数据总量:12000 协程 cost time 63.13658905029297 测试url:['http://github.com', 'http://google.com'] 数据总量:12000 多线程 cost time 143.3647768497467
协程效果比多线程好一点。
此时我们在增加几个网站 ,效果如下
python3 test.py 测试url:['http://github.com', 'http://google.com', 'https://0x43434343.com/', 'https://lorexxar.cn/'] 数据总量:4000 协程 cost time 20.34762692451477 python3 test.py 测试url:['http://github.com', 'http://google.com', 'https://0x43434343.com/', 'https://lorexxar.cn/'] 数据总量:4000 多线程 cost time 58.68550777435303
总结
一开始不明白为什么对于不同网站,协程与线程间差距会如此之大,我只能做一个粗略的大致统计,不具有权威性。
如果网站访问快,协程速度会优于线程,如果网站访问速度慢的话,线程比协程表现好。
HTTP协程底层原理是IO多路复用,参考 https://www.zhihu.com/question/20511233
如果网站访问较快,协程不需要什么消耗,因为是单线程的,而多线程会消耗一些资源,但是如果网站访问比较慢的话,协程也会在连接的时候卡住,因为是单线程的,所以也会一直卡住,而多线程就没有这种烦恼了。
最后,基于各种条件,感觉pocsuite3引入协程还不太成熟,就暂时搁置了
发表评论