URL Crawling & Caching

Twitter’s crawler will respect robots.txt when scanning URLs. If a page with card markup is blocked, no card will be shown. If an image URL is blocked, no thumbnail or photo will be shown.

Twitter uses the User-Agent of Twitterbot (with version, such as Twitterbot/1.0), which can be used to create an exception in your robots.txt file.

For example, here is a robots.txt which disallows crawling for all robots except Twitter’s fetcher:

User-agent: Twitterbot
Disallow:

User-agent: *
Disallow: /

Here is another example, which specifies which directories are allowed to be crawled by Twitterbot (in this case, disallowing all except the images and archives directories):

User-agent: Twitterbot
Disallow: *

Allow: /images
Allow: /archives

Your server’s robots.txt file must be saved as plain text with ASCII character encoding. To verify this, you can run the following command:

$ file -I robots.txt
robots.txt: text/plain; charset=us-ascii

Your content is cached by Twitter for 7 days after a link to your page with card markup has been published in a tweet.

The example below uses a mix of Twitter and Open Graph tags to define a summary card:

<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@nytimesbits" />
<meta name="twitter:creator" content="@nickbilton" />
<meta property="og:url" content="website url" />
<meta property="og:title" content="A Twitter for My Sister" />
<meta property="og:description" content="In the early days, Twitter grew so quickly that it was almost impossible to add new features because engineers spent their time trying to keep the rocket ship from stalling." />
<meta property="og:image" content=" image location url" />
Advertisements