Python爬虫 | 美国特斯拉充电桩位置信息(含经纬度)爬取
warning: 这篇文章距离上次修改已过814天,其中的内容可能已经有所变动。
概要
最近在搞美赛模拟,2018D题需要用到特斯拉在美国的两种充电桩(一种是Supercharger,另一种是Destination Charging)的位置数据。下面介绍如何用Python在Tesla官网爬取已经建立的充电桩所在州、县、详细地址、经纬度坐标以及FIPS等信息,以及即将建立的充电桩的大致位置和计划建造时间。
目录
详细
数据来源
- Supercharger
https://www.tesla.com/findus/list/superchargers/United+States - Destination Charging
https://www.tesla.com/findus/list/chargers/United+States
数据结果
tesla_charger_us.xlsx
tesla_supercharger_us.xlsx
下载:https://cowtransfer.com/s/7c7a8523a76945
程序实现
环境
Python 3.9.7
Jupyter 1.0.0
Packages
名称 | 版本 | 说明 |
---|---|---|
lxml | 4.6.3 | \ |
numpy | 1.20.3 | \ |
pandas | 1.3.4 | \ |
requests | 2.26.0 | \ |
regex | 2021.8.3 | Python自带正则表达式模块 |
progressbar2 | 4.0.0 | 用于在控制台显示进度条 |
代码
由于两种充电桩的数据页面基本一样,所以以下代码可以抓取两类充电桩的数据。记得在 # 索引页面root下面的requests.get()
里面修改source为数据来源里写的网页,并且在代码末尾修改保存csv的名称避免覆盖。
#!/usr/bin/env python
# coding: utf-8
# # 2018-MCM-D
# ## 数据爬取
# ### 特斯拉Supercharger基本信息爬取
import numpy as np
import pandas as pd
import requests
from lxml import etree
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'svg'")
# 内 容 :美国特斯拉超级充电桩位置
# Content: Tesla Superchargers' location in the US
# Source : https://www.tesla.com/findus/list/superchargers/United+States
data_col = ['name',
'common_name',
'street_address',
'extended_address',
'locality',
'google_navi',
'charging',
'coming',
'opening_date',
'detail_url']
full_data = pd.DataFrame(columns=data_col) # 存放爬取到的数据
# 索引页面root
html = requests.get("https://www.tesla.com/findus/list/chargers/United+States")
etree_html = etree.HTML(html.text)
# 统计总共在多少个州有超级充电站
states_content = etree_html.xpath('//*[@id="find-us-list-container"]/div/section/div/h2')
state_count = 0
for each_state in states_content:
print(each_state.text)
state_count = state_count + 1
print('-------------\n{} states in total'.format(state_count))
# 57 states in total, in which ME and Maine are duplicated
# 56 states in fact
# 爬取每个充电站的详情页面url
detail_url = etree_html.xpath('//*[@id="find-us-list-container"]/div/section/div/div/address/a')
root_data = pd.DataFrame(columns=['name','detail_url'])
index = 0
for each_url in detail_url:
print(each_url.text)
print(each_url.attrib)
full_url = "https://www.tesla.com"+each_url.attrib.get('href')
root_data.loc[index] = [each_url.text,full_url]
index = index + 1
print(root_data)
# ### 特斯拉Supercharger详细信息解析
from lxml import html
import re
import progressbar
p = progressbar.ProgressBar()
p.start(len(root_data))
# get further information through 'detail_url'
for index,row in root_data.iterrows():
collected_row = {'name':'',
'common_name':'',
'street_address':'',
'extended_address':'',
'locality':'',
'google_navi':'',
'charging':'',
'coming':'',
'opening_date':'',
'detail_url':''}
response = requests.get(row['detail_url'])
etree_html = etree.HTML(response.text)
#print(row['detail_url'])
# contents
pattern0 = re.compile(r'<address.*?class=\"vcard\">(.*?)</address>',re.S)
root=re.findall(pattern0,response.text)
pattern1 = re.compile(r'<span.*?class=\"common-name\">(.*?)</span>',re.S)
common_name = re.findall(pattern1,root[0])
pattern2 = re.compile(r'<span.*?class=\"street-address\">(.*?)</span>',re.S)
street_address = re.findall(pattern2,root[0])
pattern3 = re.compile(r'<span.*?class=\"extended_address\">(.*?)</span>',re.S)
extended_address = re.findall(pattern3,root[0])
pattern4 = re.compile(r'<span.*?class=\"locality\">(.*?)</span>',re.S)
locality = re.findall(pattern4,root[0])
pattern5 = re.compile(r'<a.*?href=\"(.*?)\" target=\"_blank\">',re.S)
google_navi = re.findall(pattern5,root[0])
pattern6 = re.compile(r'harging</strong>.*?>(.*?)</p>',re.S)
charging = re.findall(pattern6,root[0])
pattern7 = re.compile(r'<span.*?class=\"coming-soon\">(.*?)</span>',re.S)
coming = re.findall(pattern7,root[0])
pattern8 = re.compile(r'<p><strong>Target opening in (.*?)</strong></p>',re.S)
opening_date = re.findall(pattern8,root[0])
if(len(common_name)>0):
collected_row['common_name']=common_name[0]
#print(common_name[0])
if(len(street_address)>0):
collected_row['street_address'] = street_address[0]
#print(street_address[0])
if(len(extended_address)>0):
collected_row['extended_address'] = extended_address[0]
#print(extended_address[0])
if(len(locality)>0):
collected_row['locality'] = locality[0]
#print(locality[0])
if(len(google_navi)>0):
collected_row['google_navi'] = google_navi[0]
#print(google_navi[0])
if(len(charging)>0):
collected_row['charging'] = charging[0]
if(len(coming)>0):
collected_row['coming'] = coming[0]
#print(coming[0])
if(len(opening_date)>0):
collected_row['opening_date'] = opening_date[0]
#print(opening_date[0])
collected_row['name'] = row['name']
collected_row['detail_url'] = row['detail_url']
full_data.loc[index] = collected_row
p.update(index)
p.finish()
full_data.to_csv('data/tesla_charger_us.csv')
print(full_data)