Collecting Images from Telegram Web Without API Using Scrapy
Introduction
In this project, we aim to extract images from the Telegram channel of the Ukrainian Air Force (@kpszsu) without using the Telegram API. This approach is useful when the API is not the best choice for you. The extracted images will be used to analyze the destruction of Russian drones.
Context: Drones in the Russia-Ukraine War
Russia has frequently used Shahed-136/131 kamikaze drones in its aggression against Ukraine. These drones, supplied by Iran, are low-cost, long-range attack UAVs designed to target Ukrainian infrastructure and military assets. The Ukrainian Air Force regularly publishes reports on intercepted and destroyed drones, often represented visually through infographics. By collecting these images, we can automate data extraction and analyze trends in drone interceptions.
Step 1: Setting Up Scrapy and Headers
For this project, I chose Kaggle as the execution platform and wrote the entire script there. Kaggle provides a convenient cloud environment with pre-installed libraries, making it easy to run scripts.
I started by installing Scrapy, importing the required dependencies, and setting up request headers.
Installing Scrapy
Since Scrapy is not pre-installed in Kaggle notebooks, we need to install it first:
!pip install -q scrapy
Importing Dependencies
We import essential Python libraries for handling HTTP requests, regex, file operations, and time management. Additionally, we import Scrapy and its crawler module:
import re
import os
import json
import time
import requests
import scrapy
from scrapy.crawler import CrawlerProcess
Setting Up Headers
To avoid being blocked and to make our requests appear as if they come from a real browser, we define HTTP headers. These headers simulate a request from Mozilla Firefox 135.0 and provide necessary details such as the referer, encoding, and accepted content types.
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'http://t.me',
'DNT': '1',
'Sec-GPC': '1',
'Connection': 'keep-alive',
'Referer': 'http://t.me/s/kpszsu',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Content-Length': '0',
'TE': 'trailers',
}
Step 2: Extract Image URLs
Since Telegram Web loads images dynamically using CSS styles, we need a function to extract image URLs from CSS:
def get_url(raw_text):
match = re.search(r"url\('([^']+)'\)", raw_text)
if match:
return match.group(1)
Step 3: Creating a Scrapy Spider
We define a Scrapy spider (CustomSpider
) to:
- Iterate over Telegram archive pages.
- Extract image URLs from posts containing the ✊ (fist) symbol, which may indicate drone destruction reports.
- Save extracted image URLs in JSON format.
class CustomSpider(scrapy.Spider):
name = "parse_telegram_web"
def start_requests(self):
BEFORE = 29652
start_urls = [f"http://t.me/s/kpszsu?before={ind}" for ind in range(BEFORE, 0, -30)]
for url in start_urls:
yield scrapy.Request(
url=url,
method='GET',
dont_filter=True,
headers=HEADERS,
callback=self.parse,
)
def parse(self, response):
selector = scrapy.Selector(text=response.json())
for el in selector.css(".media_supported_cont"):
if el.xpath("//*[contains(text(), '✊')]"):
data = el.css("a::attr(style)").get()
if data:
file_name = el.css("a::attr(href)").get().rsplit("/", 1)[-1] or int(time.time()*1000)
yield {
'file_name': file_name,
'image_url': get_url(data),
}
Step 4: Running the Scrapy Spider
We run the spider and save the collected data to collect_image_url.json
:
# Run the spider
process = CrawlerProcess(
settings={
"FEEDS": {
"collect_image_url.json": {"format": "json"},
}
}
)
process.crawl(CustomSpider)
process.start()
Step 5: Downloading Images
We download images from the extracted URLs and save them to a files/
directory:
with open("/kaggle/working/collect_image_url.json") as f:
json_data = json.loads(f.read())
try:
os.mkdir("files")
except OSError as error:
print(error)
for el in json_data:
try:
response = requests.get(el["image_url"])
if response.status_code == 200:
with open(f"files/{el['file_name']}.jpg", "wb") as file:
file.write(response.content)
except Exception as err:
print(err
)
Example of an Extracted Image
Below is an example of an image that the script successfully extracts:
Conclusion
This script allows us to:
- ✅ Collect drone destruction reports from Telegram without an API.
- ✅ Extract image URLs from Telegram’s dynamic CSS structure.
- ✅ Download and save images for further analysis.
Next, we will use OpenCV and Tesseract (OCR) to extract dates and drone counts from these images.