Sitemap

Collecting Images from Telegram Web Without API Using Scrapy

3 min readMar 1, 2025

Introduction

In this project, we aim to extract images from the Telegram channel of the Ukrainian Air Force (@kpszsu) without using the Telegram API. This approach is useful when the API is not the best choice for you. The extracted images will be used to analyze the destruction of Russian drones.

Context: Drones in the Russia-Ukraine War

Russia has frequently used Shahed-136/131 kamikaze drones in its aggression against Ukraine. These drones, supplied by Iran, are low-cost, long-range attack UAVs designed to target Ukrainian infrastructure and military assets. The Ukrainian Air Force regularly publishes reports on intercepted and destroyed drones, often represented visually through infographics. By collecting these images, we can automate data extraction and analyze trends in drone interceptions.

Step 1: Setting Up Scrapy and Headers

For this project, I chose Kaggle as the execution platform and wrote the entire script there. Kaggle provides a convenient cloud environment with pre-installed libraries, making it easy to run scripts.

I started by installing Scrapy, importing the required dependencies, and setting up request headers.

Installing Scrapy

Since Scrapy is not pre-installed in Kaggle notebooks, we need to install it first:

!pip install -q scrapy

Importing Dependencies

We import essential Python libraries for handling HTTP requests, regex, file operations, and time management. Additionally, we import Scrapy and its crawler module:

import re
import os
import json
import time
import requests

import scrapy
from scrapy.crawler import CrawlerProcess

Setting Up Headers

To avoid being blocked and to make our requests appear as if they come from a real browser, we define HTTP headers. These headers simulate a request from Mozilla Firefox 135.0 and provide necessary details such as the referer, encoding, and accepted content types.

HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:135.0) Gecko/20100101 Firefox/135.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br, zstd',
'X-Requested-With': 'XMLHttpRequest',
'Origin': 'http://t.me',
'DNT': '1',
'Sec-GPC': '1',
'Connection': 'keep-alive',
'Referer': 'http://t.me/s/kpszsu',
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
'Content-Length': '0',
'TE': 'trailers',
}

Step 2: Extract Image URLs

Since Telegram Web loads images dynamically using CSS styles, we need a function to extract image URLs from CSS:

def get_url(raw_text):
match = re.search(r"url\('([^']+)'\)", raw_text)
if match:
return match.group(1)

Step 3: Creating a Scrapy Spider

We define a Scrapy spider (CustomSpider) to:

  • Iterate over Telegram archive pages.
  • Extract image URLs from posts containing the ✊ (fist) symbol, which may indicate drone destruction reports.
  • Save extracted image URLs in JSON format.
class CustomSpider(scrapy.Spider):
name = "parse_telegram_web"

def start_requests(self):
BEFORE = 29652
start_urls = [f"http://t.me/s/kpszsu?before={ind}" for ind in range(BEFORE, 0, -30)]

for url in start_urls:
yield scrapy.Request(
url=url,
method='GET',
dont_filter=True,
headers=HEADERS,
callback=self.parse,
)

def parse(self, response):
selector = scrapy.Selector(text=response.json())
for el in selector.css(".media_supported_cont"):
if el.xpath("//*[contains(text(), '✊')]"):
data = el.css("a::attr(style)").get()
if data:
file_name = el.css("a::attr(href)").get().rsplit("/", 1)[-1] or int(time.time()*1000)
yield {
'file_name': file_name,
'image_url': get_url(data),
}

Step 4: Running the Scrapy Spider

We run the spider and save the collected data to collect_image_url.json:

# Run the spider
process = CrawlerProcess(
settings={
"FEEDS": {
"collect_image_url.json": {"format": "json"},
}
}
)
process.crawl(CustomSpider)
process.start()

Step 5: Downloading Images

We download images from the extracted URLs and save them to a files/ directory:

with open("/kaggle/working/collect_image_url.json") as f:
json_data = json.loads(f.read())

try:
os.mkdir("files")
except OSError as error:
print(error)
for el in json_data:
try:
response = requests.get(el["image_url"])
if response.status_code == 200:
with open(f"files/{el['file_name']}.jpg", "wb") as file:
file.write(response.content)
except Exception as err:
print(err
)

Example of an Extracted Image

Below is an example of an image that the script successfully extracts:

Conclusion

This script allows us to:

  • ✅ Collect drone destruction reports from Telegram without an API.
  • ✅ Extract image URLs from Telegram’s dynamic CSS structure.
  • ✅ Download and save images for further analysis.

Next, we will use OpenCV and Tesseract (OCR) to extract dates and drone counts from these images.

No responses yet