The ability for web crawlers to capture your page is partially based upon web creators following best development practices.
Following best development practices is essential for the ability of web crawlers to best capture your website.
Here are a few things to think about.
The Internet Archive's crawler preserves web pages and sites, but to guarantee best results:
Delete or modify robots.txt file to allow for crawling. Test with the google tester.
Every page and media element has a unique URL (avoid platforms such as Wix, Squarespace).
Avoid orphaned pages and link rot by maintaining stable URLs
Avoid proprietary formats
Include an XML Sitemap.
Architecture that has re-directs that are not scoped by most web crawlers.
Architecture that depends upon dynamic content are typically not captured by crawlers. For example scalar content has many ‘URLs that contain ?path= ‘ Crawlers do not capture elements that require a user’s input. Searches are inherently dynamic and require user interaction.
Crawlers can only crawl the publicly accessible web, so avoid password protecting unless you absolutely need to.
Using responsive design ensures that archive users will continue to have a comparable experience of the original website, regardless of the platform they use for access.
Provide equivalent text for non-textual content can facilitate both search crawler indexing and later full-text search in the archive. Here are some useful tools to determine accessibility:
Future proof: To increase the chance that future browsers will be able to interpret today's code, validate against current web standards: http://validator.w3.org/
In addition to proper rendering of the web-page, setting character encoding in the HTTP header allows for successful capture and rendering of the archived copy insuring readability of the displayed text. It informs the browser of the character set being used. See:
Social media or calendars and other 'infinite scrolling' gadgets on the web can cause a structural issue within a website that causes crawlers to find a virtually infinite number of irrelevant URLs. In theory, crawlers could get stuck in one part of a website and never finish crawling these irrelevant URLs. Crawler never escapes and gets to where it needs to go.
Because they are interactive, complex interactive maps are not typically good candidates for web archiving. ARCGIS, StoryMaps and Flash compositions are difficult for the Internet Archive to preserve..
Whenever possible, rather than embed links, host media (multimedia, video, audio) content locally. Or, host media on the Internet Archive (archive.org), and embed the Internet Archive URL in your current website.
Streaming media (YouTube, Vimeo, and Soundcloud) platforms are not built for long term preservation. YouTube videos are easier to preserve with the Internet Archive crawler than Vimeo videos. Each YouTube video can appear only once on the entire site or the crawler will not capture either instance of the same video. Vimeo embeds can be preserved with the Internet Archive, but only one Vimeo video can be embedded on each page.
WebRecorder preserves Scalar better than the Internet Archive does.
To archive searches, collect URLs of popular search result pages and add them to a page on your site. The Internet Archive crawler might be able to capture theses searches.
Interactivity is not easily preserved by the Internet Archive. Build a static, rather than a dynamic site, and screen video capture the interactive aspects of the site, then post the video on the site. WebRecorder captures interactivity better than the Internet Archive.
WebRecorder does not archive Tableau visualizations.
WebRecorder does not archive Vega-Lite.
Is your site ArchiveReady?