Robots.txt Analyser

Robots.txt Analyzer is a powerful web tool designed to help webmasters and SEO professionals analyze, validate, and optimize their robots.txt files. Built with modern web technologies and deployed on Cloudflare's global network, this tool provides instant analysis and actionable recommendations to improve your site's crawler directives. You can try it out at robots-txt.arvid.tech.
What is Robots.txt Analyzer?
I developed this tool to address a common challenge in web development and SEO: properly configuring robots.txt files to control how search engines and other web crawlers interact with your site. While robots.txt files appear simple on the surface, they can have significant implications for your site's visibility, security, and performance. The analyzer provides a comprehensive scoring system, security recommendations, and detailed insights into your robots.txt configuration, all presented in a clean, intuitive interface that works seamlessly across devices.
The Technology Stack
Robots.txt Analyzer is built with a modern, performance-focused technology stack:
- Qwik: A revolutionary web framework that delivers near-instant loading times through resumability rather than hydration, providing an exceptional user experience
- TypeScript: Adds strong typing to JavaScript, enhancing code quality and developer experience
- Cloudflare Pages: Hosts the application with global distribution for low-latency access worldwide
- Cloudflare D1: A serverless SQL database that stores analysis history and caches results
- Tailwind CSS: Provides utility-first styling for a responsive, clean interface
- Umami Analytics: A self-hosted, privacy-focused analytics solution to track usage patterns while respecting user privacy
This architecture ensures the application is fast, reliable, and scalable, with minimal operational overhead.
Under the Hood: How the Parser Works
The heart of Robots.txt Analyzer is its sophisticated parsing and analysis engine. Let's explore how it works:
1. Parsing the Robots.txt File
The parser begins by breaking down the robots.txt file into its component parts:
export function parseRobotsTxt(content: string): RobotRule[] {
const lines = content.split('\n').map(line => line.trim());
const ruleMap = new Map<string, RobotRule>();
let currentRule: RobotRule | null = null;
for (const line of lines) {
if (!line || line.startsWith('#')) continue;
const colonIndex = line.indexOf(':');
if (colonIndex === -1) continue;
const directive = line.slice(0, colonIndex).trim().toLowerCase();
const value = line.slice(colonIndex + 1).trim();
// Process directives (User-agent, Disallow, Allow, etc.)
// ...
}
// Return processed rules
}
The parser handles all standard robots.txt directives:
- User-agent: Specifies which crawler the rules apply to
- Disallow: Paths that should not be crawled
- Allow: Exceptions to disallow rules
- Crawl-delay: Suggested pause between crawler requests
- Sitemap: URLs to XML sitemaps
2. Web Application Detection
One of the analyzer's unique features is its ability to detect common web applications and frameworks based on patterns in the robots.txt file:
function detectWebApp(rules: RobotRule[]): {
detected: WebAppSignature[],
unprotectedPaths: { category: string; paths: string[] }[]
} {
const allPaths = rules.flatMap(rule => [...rule.allow, ...rule.disallow]);
const signatures = WEB_APP_SIGNATURES.map(sig => ({...sig}));
// Check for web app signatures
signatures.forEach(sig => {
sig.patterns.forEach(pattern => {
if (allPaths.some(path => path.includes(pattern))) {
sig.confidence += 25;
}
});
});
// Check for unprotected sensitive paths
// ...
}
The analyzer recognizes patterns for popular platforms like WordPress, Drupal, Joomla, Magento, Shopify, and more. This allows it to provide platform-specific recommendations.
3. Security Analysis
The analyzer identifies potentially sensitive paths that should be protected from crawlers:
const SENSITIVE_PATHS = {
admin: [
'/wp-admin', '/administrator', '/admin', '/backend', '/manage',
// ...
],
auth: [
'/login', '/signin', '/signup', '/register', '/auth',
// ...
],
// Additional categories...
};
By checking these paths against your robots.txt rules, the analyzer can identify security vulnerabilities where sensitive areas of your site might be exposed to search engines and potentially malicious crawlers.
4. Comprehensive Scoring
The analyzer evaluates your robots.txt file against best practices and assigns a score:
export function analyzeRobotsTxt(rules: RobotRule[], baseUrl?: string): RobotsAnalysis {
const globalRule = rules.find(rule => rule.userAgent === '*');
const allSitemaps = new Set<string>();
const recommendations: Recommendation[] = [];
let score = 100;
// Various checks that may reduce the score
if (!globalRule) {
recommendations.push({
message: "Missing global rule (User-agent: *)",
severity: "error",
details: "A global rule provides default instructions for all crawlers and is considered a best practice."
});
score -= 20;
}
// Additional checks and scoring logic
// ...
return {
summary: {
totalRules: rules.length,
hasGlobalRule: !!globalRule,
totalSitemaps: allSitemaps.size,
score,
status: getStatusFromScore(score)
},
// Additional analysis results
};
}
The score reflects how well your robots.txt file follows best practices, with deductions for issues like missing global rules, unprotected sensitive paths, or platform-specific concerns.
The API Layer
The analyzer implements a RESTful API that handles the analysis process directly within the Cloudflare Pages functions:
export const onPost: RequestHandler = async ({ json, parseBody, env, request }) => {
// Authentication and validation
// Normalize and process the URL
const normalizedUrl = normalizeUrl(url);
const domain = new URL(normalizedUrl).hostname;
// Check cache for recent analyses
// Fetch and analyze robots.txt
const robotsUrl = `${new URL(normalizedUrl).origin}/robots.txt`;
const response = await fetch(robotsUrl);
const content = await response.text();
const parsedRules = parseRobotsTxt(content);
const analysis = analyzeRobotsTxt(parsedRules, normalizedUrl);
// Cache results and return response
};
The API includes intelligent caching to improve performance and reduce load on target websites. Results are cached for 60 seconds to prevent unnecessary repeated analyses.
Maintenance and Cleanup
A simple Cloudflare Worker runs as a cron job to clean up old entries in the database. Looking at the robots-txt-cron repository, we can see the worker is quite straightforward:
export interface Env {
API_KEY: string;
BASE_URL: string;
}
export default {
async fetch(request, env, ctx): Promise<Response> {
const response = await fetch(`${env.BASE_URL}/api/v1/cleanup`, {
headers: {
'X-API-Key': env.API_KEY
}
});
const result = await response.json();
return new Response(JSON.stringify(result), { status: response.status });
},
} satisfies ExportedHandler<Env>;
This worker simply calls the cleanup endpoint of the main application every 24 hours. The cleanup endpoint itself handles the actual database maintenance:
// Cleanup endpoint that removes old entries
export const onGet: RequestHandler = async ({ json, env }) => {
try {
const db = env.get("DB") as D1Database;
const cutoff = new Date();
cutoff.setHours(cutoff.getHours() - 24);
// Delete cache entries older than 24 hours
const result = await db.prepare(
"DELETE FROM cache WHERE timestamp < ?"
).bind(cutoff.toISOString()).run();
json(200, {
success: true,
deleted: result.meta.changes
});
} catch (error) {
json(500, { error: 'Failed to clean up cache' });
}
};
This automated maintenance helps keep the application running smoothly without manual intervention, ensuring that the database doesn't grow unnecessarily large with outdated analysis results.
User Experience Features
The analyzer includes several features to enhance the user experience:
- Instant Analysis: Enter a URL and get immediate feedback on your robots.txt file
- Detailed Recommendations: Actionable suggestions to improve your configuration
- Export Options: Download results in JSON or CSV format for further analysis or reporting
- History Tracking: View past analyses to track changes over time
- Mobile-Friendly Design: Works seamlessly on all devices
- Privacy-Focused Analytics: Uses self-hosted Umami analytics to respect user privacy while gathering usage insights
Real-World Applications
Robots.txt Analyzer serves several practical purposes:
SEO Optimization
By ensuring your robots.txt file correctly allows search engines to access important content while blocking unnecessary areas, you can improve your site's search visibility.
Security Enhancement
The analyzer identifies potential security risks where sensitive areas of your site might be exposed to crawlers, helping you protect administrative interfaces, login pages, and private content.
Technical Debugging
When crawling issues occur, the analyzer helps diagnose problems with your robots.txt configuration that might be preventing proper indexing.
Platform-Specific Guidance
For sites running on common platforms like WordPress, Drupal, or e-commerce systems, the analyzer provides tailored recommendations based on the specific requirements of those platforms.
Behind the Scenes: Cloudflare Integration
The analyzer leverages several Cloudflare technologies:
- Cloudflare Pages hosts the application with integrated serverless functions
- Cloudflare D1 stores analysis history and caches results
- Cloudflare KV manages user history and preferences
- Cloudflare Workers powers a small cron job for database maintenance
This serverless architecture ensures the application is fast, reliable, and scalable, with minimal operational overhead.
Looking Forward
The Robots.txt Analyzer started as a weekend project to solve a specific problem and grew into something more useful. It's a practical example of how specialized tools can simplify technical tasks that are often overlooked but still important.
The project combines modern web frameworks with serverless architecture to deliver a fast, responsive experience without the overhead of traditional hosting. Self-hosted Umami analytics provides usage insights while respecting visitor privacy.
The code is open source under the MIT License if you're curious about the technical details or want to contribute. If you just want to check your site's crawler configuration, you can find the tool at robots-txt.arvid.tech.