<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Kakkoi &#187; proxy</title>
	<atom:link href="http://42.kaizeku.com/taxonomy/proxy//feed/" rel="self" type="application/rss+xml" />
	<link>http://42.kaizeku.com</link>
	<description>web development, software, windows tips and trick</description>
	<pubDate>Sat, 12 Jul 2008 15:10:01 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
	<language>en</language>
	<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>How to track Google Proxy Hack Duplicate Contents</title>
		<link>http://42.kaizeku.com/tips/how-to-track-google-proxy-hack-duplicate-contents/</link>
		<comments>http://42.kaizeku.com/tips/how-to-track-google-proxy-hack-duplicate-contents/#comments</comments>
		<pubDate>Fri, 01 Feb 2008 06:29:10 +0000</pubDate>
		<dc:creator>Noah Ark</dc:creator>
		
		<category><![CDATA[Blackhat]]></category>

		<category><![CDATA[Google Alerts]]></category>

		<category><![CDATA[Tips]]></category>

		<category><![CDATA[CopyScape]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[google alerts]]></category>

		<category><![CDATA[google-bug]]></category>

		<category><![CDATA[proxy]]></category>

		<category><![CDATA[proxy hack]]></category>

		<category><![CDATA[webscrapper]]></category>

		<guid isPermaLink="false">http://blog.kakkoi.net/tips/how-to-track-google-proxy-hack-duplicate-contents/</guid>
		<description><![CDATA[

I&#8217;m quite surprise to see my server logs todays, Some dude decide to scrap my blog content (including my wp translations cache 100mb+ ) 
The Offending uri:
http://www.shouker.com/user1/baiheinet/2008/1/16/80897.html
I&#8217;d blocked the site but it wont stop the search engine crawler from indexing the content .
This is nasty Blackhat SEO methods to get the target website penalize for [...]]]></description>
			<content:encoded><![CDATA[
<!-- google_ad_section_start -->
<p><img src='http://blog.kakkoi.net/wp-content/uploads/2007/12/marvin-apbot-costume-by-chaoskaizer.jpg' alt='Marvin Apbot costume by chaoskaizer' width="100" height="100" longdesc="http://gmodules.com/ig/proxy?url=http://blog.kakkoi.net/wp-content/uploads/2007/12/marvin-apbot-costume-by-chaoskaizer.jpg" />I&#8217;m quite surprise to see my server logs todays, Some dude decide to scrap my blog content (including my wp translations cache 100mb+ ) </p>
<pre>The Offending uri:
http://www.shouker.com/user1/baiheinet/2008/1/16/80897.html</pre>
<p>I&#8217;d blocked the site but it wont stop the search engine crawler from indexing the content .</p>
<p>This is nasty Blackhat SEO methods to get the target website penalize for duplicate content on Major Search Engine. There is few solution that i found at various resources &darr;.<br />
<span id="more-167"></span></p>
<ul>
<li>Report to Google, <dfn title="google proxy hack report">proxyreports@gmail.com</dfn> provide the url &#038; the google search query.</li>
<li>Block the Proxy Referrer IP</li>
<li>Add special no index meta for unknown search engine spiders.
<pre>&lt;META NAME=&quot;ROBOTS&quot; CONTENT=&quot;NOARCHIVE, NOINDEX, NOFOLLOW&quot;&gt;</pre>
</li>
</ul>
<h2>How to track Google Proxy Hacked Duplicate Contents</h2>
<ol>
<li>Monitor your content with <a class="exturl icn-r" href="http://www.google.com/alerts">Google Alerts</a> try used a unique <em>Search terms</em> for your website. i.e: blog.kakkoi, myname, myunique keywords, url http://blog.kakkoi.net, base64 safe uri encode.<br />
If you have a Google Webmaster Account go to <em>Statistics &raquo; What Googlebot sees</em> used the keywords as your Google Alerts search terms.
</li>
<li>Search for copies of your page on the Web <a href="http://www.copyscape.com/" class="exturl icn-r">copyscape</a></li>
</ol>
<h2>Whitelisting Search Engine Crawler</h2>
<p>IMO blocking the IP range of Proxy Server is not very practical. Having a Whitelist of Search Engine Crawler IP (class c) might do the trick. I&#8217;m working on a script for whitelisting search engine crawler for my wordpress. Hopefully i can finished it later this week. </p>
<h2>Google Algo bugs</h2>
<p><span class="vcard"><a href="http://www.seofaststart.com/" class="url fn microformat icn-l">Dan Thies</a></span> at seofaststart.com posts a details analysis regarding this issue, check out his post &rarr; <a class="exturl icn-r" href="http://www.seofaststart.com/blog/google-proxy-hacking">Google Proxy Hacking: How A Third Party Can Remove Your Site From Google SERPs</a>.</p>
<h2>Recent Update</h2>
<ul>
<li class="cf">Caught the proxy user just after I published this articles. Its human <em>117.8.222.77 / c-net 117.8.0.0/13</em> from Tianjin, China.<br />
<a href='http://blog.kakkoi.net/wp-content/uploads/2008/02/shouker-proxy.png' title='shouker-proxy.png' type="image/png"><img src='/wp-content/uploads/2008/02/shouker-proxy.thumbnail.png' alt='shouker.com proxy user' width='128' height='41' longdesc='http://gmodules.com/ig/proxy?url=http://blog.kakkoi.net/wp-content/uploads/2008/02/shouker-proxy.png' /></a></li>
<li>The IP was graylisted on RBL &#038; cml.anti-spam.org.cn so we send a letter to abuse@cnc-noc.net</li>
</ul>
<!-- google_ad_section_end -->
]]></content:encoded>
			<wfw:commentRss>http://42.kaizeku.com/tips/how-to-track-google-proxy-hack-duplicate-contents/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How to block Google Wireless Transcoder</title>
		<link>http://42.kaizeku.com/tips/how-to-block-google-wireless-transcoder-gwt-googlebot-mobile/</link>
		<comments>http://42.kaizeku.com/tips/how-to-block-google-wireless-transcoder-gwt-googlebot-mobile/#comments</comments>
		<pubDate>Sat, 29 Dec 2007 07:11:19 +0000</pubDate>
		<dc:creator>Noah Ark</dc:creator>
		
		<category><![CDATA[GWT]]></category>

		<category><![CDATA[Google Proxy]]></category>

		<category><![CDATA[Google-mobile]]></category>

		<category><![CDATA[Tips]]></category>

		<category><![CDATA[Google]]></category>

		<category><![CDATA[Google Wireless Transcoder]]></category>

		<category><![CDATA[googlebot-mobile]]></category>

		<category><![CDATA[htaccess]]></category>

		<category><![CDATA[proxy]]></category>

		<guid isPermaLink="false">http://blog.kakkoi.net/tips/how-to-block-google-wireless-transcoder-gwt-googlebot-mobile/</guid>
		<description><![CDATA[

When Google Wireless Transcoder (GWT, Googlebot-mobile) translate your website it strip all &#8220;scripts&#8221; and render it in mobile format (XHTML mobile 1.0)Google version of &#8220;Mobile format&#8221;. To test this services go to http://google.com/gwt/n. GWT services is actually made for mobile-user but you can still surf with normal browser.
So what the heck wrong with it
The answer [...]]]></description>
			<content:encoded><![CDATA[
<!-- google_ad_section_start -->
<p><img src='http://blog.kakkoi.net/wp-content/uploads/2007/12/google_mobile.gif' alt='google_mobile.gif' width="70" height="75" class="fl" />When <strong class="fw-">Google Wireless Transcoder </strong>(GWT, Googlebot-mobile) translate your website it strip all &#8220;scripts&#8221; and render it in mobile format (XHTML mobile 1.0)<span class="td-l">Google version of &#8220;Mobile format&#8221;</span>. To test this services go to <a href="http://google.com/gwt/n" class="external icn-r exturl">http://google.com/gwt/n</a>. GWT services is actually made for mobile-user but you can still surf with normal browser.</p>
<h2>So what the heck wrong with it</h2>
<p>The answer is Yes &#038; No. This type of services is bad for webmaster that depend on ads income. Otherwise Normal Surfer would love this services as they wont need to view any ads and surf safely without &#8220;javascript embed&#8221; (from the originating website).<br />
<span id="more-113"></span></p>
<h2>How to Block Googlebot Mobile Crawler</h2>
<p>These are some server environment variables for <strong>Google Wireless Transcoder</strong> </p>
<dl id="GWT" class="profile">
<dt>USER_AGENT</dt>
<dd>Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Google Wireless Transcoder;)</dd>
<dt>HTTP_VIA</dt>
<dd>1.1 proxy.google.com:80 (squid)</dd>
<dt>HTTP_X_FORWARDED_FOR</dt>
<dd>xxx.xx.xxx.xxx, unknown</dd>
<dt>REMOTE_ADDR</dt>
<dd>209.85.138.136</dd>
<dt>REMOTE_PORT</dt>
<dd>56931</dd>
</dl>
<h2>block via .htaccess</h2>
<p>with <a href="http://httpd.apache.org/docs/1.3/mod/mod_setenvif.html" rel="external">mod_setenvif</a></p>
<pre class="prebox">
&lt;IfModule mod_setenvif.c&gt;
SetEnvIfNoCase User-Agent &quot;^Google\ Wireless\ Transcoder*&quot; gwt_agent=1
SetEnvIfNoCase User-Agent &quot;^Googlebot-Mobile*&quot; gwt_agent=1
&lt;FilesMatch &quot;(.*)&quot;&gt;
Order Allow,Deny
Allow from all
Deny from env=gwt_agent
&lt;/FilesMatch&gt;
&lt;/IfModule&gt;
</pre>
<p>or with <a href="http://httpd.apache.org/docs/1.3/mod/mod_rewrite.html" rel="external">mod_rewrite</a></p>
<pre class="prebox">
&lt;IfModule mod_rewrite.c&gt;
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Google\ Wireless\ Transcoder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Mobile
RewriteRule ^.* - [F,L]
&lt;/IfModule&gt;
</pre>
<h2>Robots Exclusion Standards</h2>
<pre class="prebox">
User-agent: Googlebot-Mobile
Disallow: /
</pre>
<h3>Google Webmaster Analyze robots.txt</h3>
<p>After you add the above robot.txt code login to your <a href="http://www.google.com/webmasters/tools/" class="google icn-l1">Google Webmaster Central</a>. </p>
<ol class="cb">
<li>Select Tools &gt; Analyze robots.txt </li>
<li>Select <tt class="di">Google Mobile : Crawls page for our mobile index</tt> on &#8220;user-agents dropdown list&#8221;.</li>
</ol>
<p><img src='http://gmodules.com/ig/proxy?url=http://blog.kakkoi.net/wp-content/uploads/2007/12/google-webmaster-tools-analyze-robotstxt.png' alt='google-webmaster-tools-analyze-robotstxt.png' /></p>
<h2 class="cb">Embed HTML Meta Link header</h2>
<pre class="smallbox">
&lt;link rel=&quot;alternate&quot; media=&quot;handheld&quot; href=&quot;http://changethis-url-for-mobile-user&quot; /&gt;
</pre>
<h2>Google Support</h2>
<p>If you want to prevent Google Mobile services from transcoding your page its recommended to request for removal via <a href="http://www.google.com/support/mobile/bin/request.py?contact_type=googlebot">Google Mobile Support</a> form. </p>
<h2>Soap</h2>
<p>If google-mobile can restrict this services for mobile only view or maybe implement something like &#8220;<a href="http://www.duggtrends.com">duggmirror</a>&#8221; for normal browsing, it would be welcome. </p>
<!-- google_ad_section_end -->
]]></content:encoded>
			<wfw:commentRss>http://42.kaizeku.com/tips/how-to-block-google-wireless-transcoder-gwt-googlebot-mobile/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
