Against counterfeiting: web scraping comes to the rescue
The Organisation for Economic Co-operation and Development (OECD) estimates that as much as 6.8% of all EU imports have been in counterfeit and pirated goods. Since COVID-19 has greatly accelerated the trend towards e-commerce and social media, counterfeiters are benefitting from this rise in digital channels, as there are more places to list items online.
The OECD has been keeping a close eye on the impact of counterfeiting and initially predicted that counterfeit goods comprised 3-5% of the global trade between 1990 and 1998. Since 1998, the percentage of counterfeit goods in regard to world trade has fluctuated down to 2.5% and back up to 3.3% in 2016. In 2016, the global exports were $16 trillion – 3.3% of that would be a $528 billion.
Outside of the outrageous economic impact, there are other costs involved. For fashion and luxury goods, this means loss of revenue and, potentially, loyal customers. Red Points research indicates that 11% of all counterfeit goods transactions would have otherwise been real purchases. Low-quality goods will typically become useless much quicker than legitimate products and customers unaware of the fraud might be more likely to turn to competitors in the search of better quality.
Counterfeit goods trade includes apparel, copyrighted works (such as movies or video games), chemical products, consumer electronics, personal accessories, and numerous other categories that can directly impact the consumer (e.g. pharmaceuticals). Unsuspecting consumers might receive an inferior product that may have smaller amounts of active ingredients causing undue harm to their health. Additionally, counterfeit pharmaceuticals will often have little to no quality control which could affect the resulting product and produce more side effects.
Finally, there are impacts on governments and countries. Seizures of counterfeit goods at customs have a large financial impact and might direct attention from even more pressing matters (such as arms trafficking). Additionally, there are significant tax revenue losses and high costs for proceeding over intellectual property infringement trials incurred by governments each year due to businesses raising numerous cases against counterfeiters.
How web crawling helps businesses
In most countries, governments can rarely take action against counterfeit goods on their own other than the seizure of counterfeit goods at customs and prohibitions on reselling such products. Thus, businesses are often left to their own devices to protect their intellectual properties.
Businesses have to do a lot of legwork before they can build a legal case. Even finding counterfeit goods can be difficult without the correct tools. However, advancements in public online data acquisition have made finding counterfeit goods sellers a lot easier. With the power of web crawling, finding illegitimate listings can be automated for evidence retrieval.
What is web crawling?
Web crawling is the process of automated goal-based online data retrieval, where sophisticated bots browse the internet and download the data requested. These applications are commonly developed with high-level programming languages that have useful packages for web crawling. A great example is Python. There are libraries for browser and HTTP method automation that serve as great foundations for web crawlers.
Numerous applications for web crawlers exist. Wherever large amounts of data might be useful, these automated extraction tools may be applied. They are now widely used by e-commerce, marketing and SaaS companies for various purposes.
However, few companies develop their own in-house web crawlers and maintaining them is fraught with challenges. At the very least, maintenance is costly due to the volatile nature of internet pages. Changing website layouts will most often require code rewrites to continue operations as usual.
Additionally, most websites attempt to protect themselves from any type of bot traffic. They often do not care to discern between bots with good or bad intentions and apply blanket solutions to both categories. Therefore, effective web crawling needs a way to change the source of connection requests frequently. Proxies are commonly employed as the primary way to continually change IP addresses in order to maintain access to the websites in question.
Crawling to protect
Web crawlers can browse and retrieve data from thousands or even tens of thousands of pages per second and may be used to monitor e-commerce platforms, social media, and other websites where counterfeit products may be sold. Crawlers do not even need to go through each and every listing.
According to research by RedPoints, most counterfeiters will use relevant descriptive keywords. For example, instead of using the San Diego designer glasses Knockabout brand name, they might list the product as “San Diego sunglasses” and use a clear image with the brand name.
Other common keywords include “like original”, “AAA quality” and the like. Of course, these often giveaway that the product being sold is not genuine, making it easier to spot these listings for web crawlers and consumers. Unfortunately, some consumers still opt to buy counterfeit goods.
By developing (or buying) a web crawler, not only can businesses maintain a strong brand presence online but they can profit from preventing counterfeit goods sales by directing customers to original products.
There are three primary web crawler deployment methods for brand protection and each of these methods has to be employed to maximise the probability of finding counterfeit good listings:
- Regular search engine scraping: Finding listings on search engines is quite simple once a web crawler is up and running. All that needs to be done is getting a large collection of known keywords used to list counterfeit (and original) products. Web crawlers would then automatically return the search engines’ results that may lead to counterfeit listings on certain websites.
- Crawling e-commerce platforms: Another option is to crawl known marketplaces. E-commerce websites often have a lot of categories for products, making it a lot less resource-intensive to find and monitor listings. As even counterfeiters will use the correct taxonomy to list their products, creating and maintaining a keyword list will not be necessary (though it might be helpful).
- Reverse image search: Counterfeit products can be found using reverse image search. Although these web crawlers would be slightly different, they would feed images of legitimate products into the search engine and scrape the URL results to see where copies of those images are being used online. Counterfeiters will often use copies of the original product image or a very similar one in their product listings.
All these search queries are just a small part of the process of catching counterfeit product providers. Companies use web crawling to gather evidence for intellectual property protection, regaining hard-fought market share and profits.
Many companies use third-party providers for brand protection. These companies usually only gather data on the use of the intellectual property and deliver insights to their clients. However, the process of gathering evidence is not as difficult if the required tools are readily available. It is within the realm of possibility that businesses could create in-house teams that would keep an eye on counterfeiters.
After all, if the data from RedPoints is correct, 11% of all counterfeit goods transactions would have been real sales. As I see it, creating and maintaining in-house web data monitoring systems and teams would cost only a fraction of these lost sales.