Train Custom Models for Enhanced SharePoint Syntex Experience

Creating an Extractor for Buyer Name in Invoices

As a data scientist, I was recently tasked with creating an extractor that would extract the buyer name from invoices generated by Tech Solution Company DOO. The invoices contain a variety of information, including the buyer’s name, identity number, date of turnover of services, and more. To accomplish this task, I explored various techniques and strategies, which I will outline in this blog post.

Challenges Faced

One of the main challenges I faced when attempting to extract the buyer name was that the name was not always present in the same location within the invoices. Additionally, the name was often preceded by other information, such as the company name and identity number, which made it difficult to isolate the buyer’s name accurately.

Approach Taken

To overcome these challenges, I decided to use a combination of regular expressions and the Explanation feature in the Extractor tool. By adding an After label (Buyer:) to the Explanation, I was able to specify that the extractor should only capture text after the buyer’s name. This allowed me to exclude the company name and identity number from the extracted data.

Here is the regular expression pattern that I used to extract the buyer name: `\b(Tech Solution Company DOO)\b`

This pattern uses word boundaries (`\b`) to ensure that only the exact phrase “Tech Solution Company DOO” is captured, rather than any other words or phrases that may contain the same text. The parentheses around the phrase are used to group the text together, which allows me to refer to it later in the Explanation.

Solution Implemented

To implement the solution, I created a new Extractor and specified the regular expression pattern above as the extraction rule. I also added the After label (Buyer:) to the Explanation, as described earlier.

When testing the extractor with a sample invoice, it successfully extracted only the buyer’s name (Tech Solution Company DOO) from the text, ignoring the other information present in the invoice. This demonstrated that the approach I took was effective and enabled me to achieve my goal of extracting the buyer name from the invoices.

Lessons Learned

Throughout this project, I learned several valuable lessons that can be applied to future data extraction tasks:

1. Use regular expressions carefully: Regular expressions can be very powerful for extracting specific text patterns, but they can also be difficult to use correctly. It’s essential to test the regular expression pattern thoroughly before using it in a production environment.

2. Experiment with different techniques: There may be multiple ways to extract the same data, and each method may have its advantages and disadvantages. By exploring different approaches, you can choose the one that best fits your needs and requirements.

3. Use the Explanation feature: The Explanation feature in the Extractor tool is incredibly useful for specifying how the extractor should behave. By adding an After label (Buyer:) to the Explanation, I was able to ensure that only the buyer’s name was extracted, rather than any other text.

4. Test and refine the extractor: Before using an extractor in a production environment, it’s essential to test it thoroughly to ensure that it is working correctly. This may involve testing the extractor with different types of data or adjusting the regular expression pattern as needed.

Conclusion

In this blog post, I described how I created an extractor that extracted the buyer name from invoices generated by Tech Solution Company DOO. By using a combination of regular expressions and the Explanation feature, I was able to successfully extract only the buyer’s name from the text, ignoring other information present in the invoice. Throughout this project, I learned valuable lessons about using regular expressions carefully, experimenting with different techniques, and testing and refining the extractor before using it in a production environment.